Imputation and quality control steps for combining multiple genome-wide datasets

Shefali S Verma; Mariza de Andrade; Gerard Tromp; Helena Kuivaniemi; Elizabeth Pugh; Bahram Namjou-Khales; Shubhabrata Mukherjee; Gail P Jarvik; Leah C Kottyan; Amber Burt; Yuki Bradford; Gretta D Armstrong; Kimberly Derr; Dana C Crawford; Jonathan L Haines; Rongling Li; David Crosslin; Marylyn D Ritchie

doi:10.3389/fgene.2014.00370

Imputation and quality control steps for combining multiple genome-wide datasets

Front Genet. 2014 Dec 11:5:370. doi: 10.3389/fgene.2014.00370. eCollection 2014.

Authors

Shefali S Verma¹, Mariza de Andrade², Gerard Tromp³, Helena Kuivaniemi³, Elizabeth Pugh⁴, Bahram Namjou-Khales⁵, Shubhabrata Mukherjee⁶, Gail P Jarvik⁶, Leah C Kottyan⁵, Amber Burt⁶, Yuki Bradford¹, Gretta D Armstrong¹, Kimberly Derr³, Dana C Crawford⁷, Jonathan L Haines⁸, Rongling Li⁹, David Crosslin⁶, Marylyn D Ritchie¹

Affiliations

¹ Department of Biochemistry and Molecular Biology, Center for Systems Genomics, The Pennsylvania State University Pennsylvania, PA, USA.
² Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic Rochester, MN, USA.
³ The Sigfried and Janet Weis Center for Research, Geisinger Health System Danville, PA, USA.
⁴ Center for Inherited Disease Research, John Hopkins University Baltimore, MD, USA.
⁵ Cincinnati Children's Hospital Medical Center Cincinnati, OH, USA.
⁶ Department of Medicine, University of Washington Seattle, WA, USA.
⁷ Center for Human Genetics Research, Vanderbilt University Nashville, TN, USA ; Department of Epidemiology and Biostatistics, Case Western University Cleveland, OH, USA.
⁸ Department of Epidemiology and Biostatistics, Case Western University Cleveland, OH, USA.
⁹ Division of Genomic Medicine, National Human Genome Research Institute Bethesda, MD, USA.

Abstract

The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R (2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R (2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

Keywords: eMERGE; electronic health records; genome-wide association; imputation.

Abstract

Grants and funding