Population Stratification and Underrepresentation of Indian Subcontinent Genetic Diversity in the 1000 Genomes Project Dataset

Dhriti Sengupta; Ananyo Choudhury; Analabha Basu; Michèle Ramsay

doi:10.1093/gbe/evw244

Population Stratification and Underrepresentation of Indian Subcontinent Genetic Diversity in the 1000 Genomes Project Dataset

Genome Biol Evol. 2016 Dec 31;8(11):3460-3470. doi: 10.1093/gbe/evw244.

Authors

Dhriti Sengupta¹, Ananyo Choudhury¹, Analabha Basu², Michèle Ramsay^{3

4}

Affiliations

¹ Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.
² National Institute of Biomedical Genomics, Kalyani, India ab1@nibmg.ac.in michele.ramsay@wits.ac.za.
³ Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa ab1@nibmg.ac.in michele.ramsay@wits.ac.za.
⁴ Division of Human Genetics, School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.

Abstract

Genomic variation in Indian populations is of great interest due to the diversity of ancestral components, social stratification, endogamy and complex admixture patterns. With an expanding population of 1.2 billion, India is also a treasure trove to catalogue innocuous as well as clinically relevant rare mutations. Recent studies have revealed four dominant ancestries in populations from mainland India: Ancestral North-Indian (ANI), Ancestral South-Indian (ASI), Ancestral Tibeto-Burman (ATB) and Ancestral Austro-Asiatic (AAA). The 1000 Genomes Project (KGP) Phase-3 data include about 500 genomes from five linguistically defined Indian-Subcontinent (IS) populations (Punjabi, Gujrati, Bengali, Telugu and Tamil) some of whom are recent migrants to USA or UK. Comparative analyses show that despite the distinct geographic origins of the KGP-IS populations, the ANI component is predominantly represented in this dataset. Previous studies demonstrated population substructure in the HapMap Gujrati population, and we found evidence for additional substructure in the Punjabi and Telugu populations. These substructured populations have characteristic/significant differences in heterozygosity and inbreeding coefficients. Moreover, we demonstrate that the substructure is better explained by factors like differences in proportion of ancestral components, and endogamy driven social structure rather than invoking a novel ancestral component to explain it. Therefore, using language and/or geography as a proxy for an ethnic unit is inadequate for many of the IS populations. This highlights the necessity for more nuanced sampling strategies or corrective statistical approaches, particularly for biomedical and population genetics research in India.

Keywords: 1000 Genomes Project; Indian genomic diversity; ancestry; population structure; social stratification.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Bias
Datasets as Topic / standards*
Genome, Human*
Human Genome Project
Humans
India
Polymorphism, Genetic*
Population / genetics*

Grants and funding

U54 HG006938/HG/NHGRI NIH HHS/United States