Towards a reference genome that captures global genetic diversity

Nat Commun. 2020 Oct 30;11(1):5482. doi: 10.1038/s41467-020-19311-w.

Abstract

The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Chromosome Mapping
  • Computational Biology
  • Gene Expression
  • Genetic Variation*
  • Genome, Human*
  • Genomics
  • Genotyping Techniques
  • Humans
  • Molecular Sequence Annotation
  • Population / genetics*
  • RNA-Seq
  • Sequence Analysis, DNA
  • Transcriptome
  • Whole Genome Sequencing