Academic journal article Genetics

A Perfect Match Genomic Landscape Provides a Unified Framework for the Precise Detection of Variation in Natural and Synthetic Haploid Genomes

Academic journal article Genetics

A Perfect Match Genomic Landscape Provides a Unified Framework for the Precise Detection of Variation in Natural and Synthetic Haploid Genomes

Article excerpt

AT the heart of genomics lies the precise determination of an organism's DNA sequence. Genome projects typically generate large amounts of sequence reads, which constitute a fragmented and unordered representation of genetic information. Sequence reads are subsequently assembled either de novo (Zerbino and Birney 2008) or through comparison with the ordered genetic information of a reference genome (Metzker 2010). Reference genomes exist for different species, from bacteria (Blattner et al 1997) to human (International Human Genome Sequencing Consortium 2004). The central position of reference genomes as platforms for uncovering the nucleotide sequence of related genomes, underscores the importance of their continuous refinement according to conceptual and methodological advances (Goodwin et al. 2016). Genomic studies typically contrast the variation profiles of genomes of interest in a broad set of contexts, ranging from experimental evolution (Tenaillon et al 2016) to personalized medicine (Abrahams and Eck 2016). The generation of both high-quality reference genomes and precise variation profiles between genomes is therefore of utmost importance.

Most current algorithms for detecting genome variation are based on mapping sequence reads from the query genome to the reference genome to infer their corresponding locations. The problem of aligning sequence reads to a reference genome is central to genomics (Pfeifer 2017), as it is the basis for such procedures as whole genome sequencing (1000 Genomes Project Consortium 2012), exome (Teer and Mullikin 2010) and transcriptome sequencing (Wang et al. 2009), and chromatin immunoprecipitation sequencing (Park 2009), among others. A plethora of programs exist to solve this problem computationally (Mardis 2013; Goodwin et al. 2016). Most methods index the reference genome into highly optimized data structures (Kurtz et al. 2004; R. Q. Li et al. 2008; H. Li et al. 2008; Li and Durbin 2009; Chaisson and Tesler 2012; Langmead and Salzberg 2012; Holt and McMillan 2014), generating a variety of specialized algorithms (Schbath et al. 2012). Due to experimental error (Yang et al. 2013) or true variance between the query genome and the reference genome, most sequence reads do not match exactly with the reference genome. Thus, all aligners ultimately try to solve the "approximate string matching" problem (Reinert et al. 2015) using some arbitrary measure of "acceptable in-exactness." The optimal placement of sequence reads is therefore reported in conjunction with some measure of reliability. Consequently, the discovered variants and the resulting query genome sequence are likewise statistical in nature (McKenna etal. 2010; Li 2011; Koboldt etal. 2012; Rimmer et al. 2014).

We have conceptualized the analysis of genome variation from a different perspective, decomposing it into two independent processes. First, finding where the query genome and the reference genome are not identical, and second, revealing the nature of the underlying variants. The precise location of genome sites affected by variation is directly determined from a genome-wide identity landscape, or Perfect Match Genomic Landscape (PMGL). Variant characterization can thus be conducted locally, and solutions can be validated in a qualitative manner. We have previously reported the potential to precisely locate single nucleotide variants by individualizing regions of the reference genome (Reyes et al. 2011), a step that is incorporated into the PMGL strategy. Most interestingly, a recent study has developed an algorithm that is based on a similar principle to that of the PMGL strategy (Audano et al. 2017). Their algorithm also reduces the variant search space by first identifying regions that differ between the query genome and the reference genome. In addition to determining the variation profiles of both natural and synthetic query genomes, the non-statistical nature of the PMGL strategy is particularly suited for refining reference genomes. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.