Academic journal article Genetics

Recombination and the Properties of Tajima's D in the Context of Approximate-Likelihood Calculation

Academic journal article Genetics

Recombination and the Properties of Tajima's D in the Context of Approximate-Likelihood Calculation

Article excerpt


I show that Tajima's D, a commonly used summary of the site-frequency spectrum for single-nucleotide polymorphism data, is a biased summary of the site-frequency spectrum. Under neutral models, this bias depends on the population recombination rate. This bias of D in summarizing the data makes inference of demographic parameters sensitive to assumptions about recombination rates.

THE complexity of population-genetic data provides serious challenges when making statistical inferences about the demographic and selective histories of populations. Because full-likelihood methods are either intractable or overly computationally intensive for many models of interest, inferences tend to be obtained on the basis of summaries of the data, rather than on the full data themselves. Broadly speaking, there are two types of summaries of single-nucleotide polymorphism (SNP) data. The first is summaries of the site-frequency spectrum (i.e., the distribution of SNP frequencies in the sample), of which Tajima's D (TAJIMA 1989) is the best known. The second class is summaries of the associations between SNPs in the sample (linkage disequilibrium) (e.g., WALL 1999).

In Equation 1, ε is a tolerance (or set of tolerances) that represents a trade-off between accuracy and required computational time. As ε decreases, Equation 1 converges to a more precise estimate the likelihood of the data, at the cost of having to simulate more replicates to accurately estimate P(T>\&) (see BEAUMONT et al. 2002 for a detailed discussion). Simulating over a grid of θ provides an estimate of the likelihood surface.

While the use of summary statistics results in a loss of information, it is possible to develop inference procedures that perform well (e.g., WALL 2000; BEAUMONT et al. 2002; PRZEWORSKI 2003). The performance of estimators depends in part on the choice of both which summary statistic to use and how many different summaries to use (BEAUMONT et al. 2002). This note emphasizes a third point that affects accuracy of inference, namely how accurately the summary statistic summarizes the data. Several recent articles have applied the SSL approach in an attempt to distinguish the effects of demography from natural selection in natural or domesticated populations (GLINKA et al. 2003; AKEY et al. 2004; TENAILLON et al. 2004). In general, the idea is to find a demographic model that fits the data and then identify outlier loci that are putative targets of recent selection. When the interest is in using the SSL approach to make inference about demographic models, it is common to conduct the coalescent simulations without recombination (GLINKA et al. 2003; AKEY et al. 2004) (see TENAILLON et al. 2004 for an exception). The rationale for this is twofold. First, the appropriate value of the population recombination rate to use in the simulations is often unclear, and both genetic map-based and population genetic-based estimates have their drawbacks. second, it is generally argued that the expectation of many summary statistics does not depend on the recombination rate, but rather the variance decreases as ρ increases. These arguments appear to have been interpreted to imply that point estimates obtained via simulation without recombination will be correct, but that the size of confidence intervals will be overestimated.

Here I focus on a widely used summary of the sitefrequency spectrum, TAJIMA's (1989) D. D is defined as the standardized difference between two estimators of Θ, the population mutation rate. The numerator of the statistic is O17 - Ow, where O17 is the mean number of pairwise differences between individuals in the sample (TAJIMA 1983), and Ow is WATTERSON'S (1975) moment estimator. The denominator of D, which we label here as k, is an estimate of V(Q^, - §w) and is calculated as a function of the number of segregating sites in the sample (Equation 38 in TAJIMA 1989). First, I show that the expectation of Tajima's D is a biased summary of the expected site-frequency spectrum. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.