Abstract
Structural variation (SV) is typically defined as variation within the human genome that exceeds 50 base pairs (bp). SV may be copy number neutral or it may involve duplications, deletions, and complex rearrangements. Recent studies have shown SV to be associated with many human diseases. However, studies of SV have been challenging due to technological constraints. With the advent of third generation (long-read) sequencing technology, exploration of longer stretches of DNA not easily examined previously has been made possible. In the present study, we utilized third generation (long-read) sequencing techniques to examine SV in the EGFR landscape of four haplotypes derived from two human samples. We analyzed the EGFR gene and its landscape (+/- 500,000 base pairs) using this approach and were able to identify a region of non-coding DNA with over 90% similarity to the most common activating EGFR mutation in non-small cell lung cancer. Based on previously published Alu-element genome instability algorithms, we propose a molecular mechanism to explain how this non-coding region of DNA may be interacting with and impacting the stability of the EGFR gene and potentially generating this cancer-driver gene. By these techniques, we were also able to identify previously hidden structural variation in the four haplotypes and in the human reference genome (hg38). We applied previously published algorithms to compare the relative stabilities of these five different EGFR gene landscape haplotypes to estimate their relative potentials to generate the EGFR exon 19, 15 bp canonical deletion. To our knowledge, the present study is the first to use the differences in genomic architecture between targeted cancer-linked phased haplotypes to estimate their relative potentials to form a common cancer-linked driver mutation.
MeSH terms
-
Carcinoma, Non-Small-Cell Lung / genetics
-
Computer Simulation
-
Genes, erbB-1 / genetics*
-
Genetic Variation*
-
Genome, Human / genetics*
-
Genomic Instability*
-
Haplotypes
-
High-Throughput Nucleotide Sequencing*
-
Humans
-
Lung Neoplasms / genetics
-
Sequence Analysis, DNA
Grants and funding
Please note that several of the co-authors are employed by commercial entities: Roche Sequencing Solutions [GFM, CM, DR, DLB], Pacific Biosciences [WJR, CL, KE, JG, PB], Lion Elastomers [JTF], Nouryon Polymer Chemicals [HDH] and Sentry Genomics [GWC]. Roche Sequencing Solutions played a role in providing material and technical resources for the development of the long-read sequencing library preparation and custom targeted capture protocol designed specifically for this study. Pacific Biosciences provided material and technical resources for the long-read SMRT sequencing of captured samples and bioinformatics assembly of sequencing results. Both of these commercial entities provided support in the form of salaries for authors [GFM, CM, DR, DLB, WJR, CL, KE, JG, PB], but did not have any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript (other than providing their respective technical protocols in the materials and methods section of the manuscript). John T. Fussell, an employee of Lion Elastomers, contributed by assisting in the preparation and review of the statistical analyses within the manuscript. Heath D. Herbold, an employee of Nouryon Polymer Chemicals, contributed by assisting in the preparation of the figures for the manuscript. Each of these author’s participation in the preparation of this manuscript was conducted as independent researchers and their contributions in its preparation were made wholly outside of their respective employments. All authors were given the opportunity to review, provide feedback and approve the manuscript. The specific roles of all authors are articulated in the ‘author contributions’ section.