Academic journal article Genetics

Predicting Discovery Rates of Genomic Features

Academic journal article Genetics

Predicting Discovery Rates of Genomic Features

Article excerpt

ABSTRACT Successful sequencing experiments require judicious sample selection. However, this selection must often be performed on the basis of limited preliminary data. Predicting the statistical properties of the final sample based on preliminary data can be challenging, because numerous uncertain model assumptions may be involved. Here, we ask whether we can predict "omics" variation across many samples by sequencing only a fraction of them. In the infinite-genome limit, we find that a pilot study sequencing 5% of a population is sufficient to predict the number of genetic variants in the entire population within 6% of the correct value, using an estimator agnostic to demography, selection, or population structure. To reach similar accuracy in a finite genome with millions of polymorphisms, the pilot study would require ~15% of the population. We present computationally efficient jackknife and linear programming methods that exhibit substantially less bias than the state of the art when applied to simulated data and subsampled 1000 Genomes Project data. Extrapolating based on the National Heart, Lung, and Blood Institute Exome Sequencing Project data, we predict that 7.2% of sites in the capture region would be variable in a sample of 50,000 African Americans and 8.8% in a European sample of equal size. Finally, we show how the linear programming method can also predict discovery rates of various genomic features, such as the number of transcription factor binding sites across different cell types.

(ProQuest: ... denotes formulae omitted.)

PREDICTING the genetic makeup of a large population sample based on a small subsample serves two distinct purposes. First, it can facilitate study design by providing the expected number of samples needed to achieve a given dis- covery goal, be it enough markers for a custom array design or enough rare variants to perform a well-powered burden test. Second, such predictions serve as a useful test for our statistical and evolutionary hypotheses about the population. Because evolutionary experiments for long-lived organisms are extremely difficult, predictions about evolution are hard to falsify. By contrast, predictions about the outcome of se- quencing experiments can be easily tested, due to the rapid advances in sequencing technology. This opportunity to test our models should be taken advantage of. Here, we show that such predictions can be easily generated to high accu- racy and in a way that is robust to many model assumptions such as mating patterns, selection, and population structure.

We are interested in predicting the number of sites that are variable for some "omic" feature across samples. Fea- tures may be of different types (SNPs, indels, binding sites, epigenetic markers, etc.), and samples may be cells, cell types, whole organisms, or even entire populations or spe- cies. For definiteness, we focus primarily on predicting the discovery rate of genetic variants (SNPs or indels) in a pop- ulation. Because variant discovery is central to many large- scale sequencing efforts, many methods have been proposed to predict the number of variants discovered as a function of sample size in a given population. Some methods require explicit modeling of complex evolutionary scenarios, fitting parameters to existing data (Eberle and Kruglyak 2000; Durrett and Limic 2001; Gutenkunst et al. 2009; Gravel et al. 2011; Lukíc et al. 2011). These approaches enable model testing, but they are complex and computationally intensive. The interpretation of model parameters can also be challenging (Myers et al. 2008). Ionita-Laza et al. (2009) pointed out a similarity between the variant discov- ery problem and a well-studied species counting problem in ecology (Pollock et al. 1990), and this led to the develop- ment of tractable heuristic approaches that rely on simple assumptions about underlying distributions of allele fre- quencies (Ionita-Laza et al. 2009; Ionita-Laza and Laird 2010; Gravel et al. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.