Academic journal article Genetics

An Incomplete Understanding of Human Genetic Variation

Academic journal article Genetics

An Incomplete Understanding of Human Genetic Variation

Article excerpt

UNCOVERING the genetic basis of human disease and phenotype requires an understanding of the nature and pattern of human genetic variation. This includes not only variant discovery and accurate genotyping but a resolution of the haplotype structure and the mutational properties that have shaped our genome. The completion of phase 3 of the 1000 Genomes Project (Auton et al. 2015) was an important landmark in this regard. More than 2500 "normal genomes" were sequenced from 26 different human populations, revealing an impressive 84.7 million single-nucleotide variants (SNVs), 3.6 million insertion/deletion (indel) variants, and .60,000 structural variants (SVs). The latter are distinguished from indels based on event sizes greater than or equal to 50 bp in length (Sudmant et al. 2015b). While there are other similar population-based genome sequencing projects that have been recently completed (Genome of the Netherlands Consortium 2014; Sudmant et al. 2015a) or are underway (e.g., UK10K Consortium et al. 2015),most are smaller in scale and/or have more restrictions with respect to data access and use. As a result, the 1000 Genomes Project variants serve as one of the most powerful resources for understanding the normal pattern of human genetic variation.

There are two current limitations with this catalog of human genetic variation. First, it is derived from relatively sparse genome sequence data (six- to sevenfold sequence coverage). The decision to sequence genomes at this level of coverage was only partially an economic one. It was driven largely by population genetic theory where most of the common genetic variation (.1% allele frequency) could be resolved by imputation as a result of linkage disequilibrium (1000 Genomes Project Consortium 2010). As a result, more genomes were strategically sequenced rather than sequencing fewer genomes more deeply. The project exceeded expectations, detecting an estimated 75% of SNVs with an allele frequency of .0.1%. This approach had limited power to detect rare variants (,0.1% frequency) and SVs (irrespective of their allele frequency). For diseases where rare variants or SVs are known to play an important role (e.g., epilepsy, intellectual disability, autism, and schizophrenia) (Hoischen et al. 2014), larger and deeper datasets, such as the Exome Aggregation Consortium (ExAC) database for SNV mutations within coding sequence (Song et al. 2015) and SV databases developed from thousands of population controls (Coe et al. 2014; MacDonald et al. 2014), are critically important.

The second limitation is that not all genetic variation has been equally ascertained even after conditioning on the allele frequency. Detailed targeted sequencing of regions of the human genome suggests that indels should occur at approximately one-tenth of the frequency of SNVs (Bhangale et al. 2005), suggesting that the current catalog may be missing at least 30-40% of all indels. Detection of indels associated with short tandem repeat (STR) sequences is particularly challenging and specialized methods have been developed to discover and accurately genotype these from next-generation sequencing datasets (Karakoc et al. 2012; Narzisi et al. 2014; Willems et al. 2014; Chaisson et al. 2015a). Sensitivity for indel variant discovery is generally much lower than for SNVs. A comparison of 170 genomes sequenced to high coverage by an orthogonal sequencing platform(Complete Genomics) suggests that less than 75% of indels with an allele frequency of 0.5% were detected. The sensitivity of indel detection drops precipitously as the allele frequency dips below 0.3% (Auton et al. 2015).

The situation for SVs is, in fact, much bleaker with respect to sensitivity and specificity. This stems from the fact that discovery of these variants is largely indirect, depending on mapping short-read sequencing data using read-depth or read-pair detection methods. Thus, unlike SNVs where discovery and sequence resolution occur simultaneously, deletions, duplications, and inversions are often inferred based on specific signatures, with breakpoint resolution occurring post hoc. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.