Haplotype, or the sequence of alleles along a single chromosome, has important applications in phenotype-genotype association studies, as well as in population genetics analyses. Because haplotype cannot be experimentally assayed in diploid organisms in a high-throughput fashion, numerous statistical methods have been developed to reconstruct probable haplotype from genotype data. These methods focus primarily on accurate phasing of a short genomic region with a small number of markers, and the error rate increases rapidly for longer regions. Here we introduce a new phasing algorithm, emphases, which aims to improve long-range phasing accuracy. Using datasets from multiple populations, we found that emphases reduces long-range phasing errors by up to 50% compared to the current state-of-the-art methods. In addition to inferring the most likely haplotypes, emphases produces confidence measures, allowing downstream analyses to account for the uncertainties associated with some haplotypes. We anticipate that emphases offers a powerful tool for analyzing large-scale data generated in the genome-wide association studies (GWAS).
Keywords: Expectation Maximization; graphical model; haplotype; phasing.