Maximum-Likelihood Estimation of Site-Specific Mutation Rates in Human Mitochondrial DNA from Partial Phylogenetic Classification
Rosset, Saharon, Wells, R. Spencer, Soria-Hernanz, David F., Tyler-Smith, Chris, Royyuru, Ajay K., Behar, Doron M., Genetics
The mitochondrial DNA hypervariable segment I (HVS-I) is widely used in studies of human evolutionary genetics, and therefore accurate estimates of mutation rates among nucleotide sites in this region are essential. We have developed a novel maximum-likelihood methodology for estimating site-specific mutation rates from partial phylogenetic information, such as haplogroup association. The resulting estimation problem is a generalized linear model,with a nonstandard link function. We develop inference and bias correction tools for our estimates and a hypothesis-testing approach for site independence. We demonstrate our methodology using 16,609 HVS-I samples from the Genographic Project. Our results suggest that mutation rates among nucleotide sites in HVS-I are highly variable. The 16,400-16,500 region exhibits significantly lower rates compared to other regions, suggesting potential functional constraints. Several loci identified in the literature as possible termination-associated sequences (TAS) do not yield statistically slower rates than the rest of HVS-I, casting doubt on their functional importance. Our tests do not reject the null hypothesis of independent mutation rates among nucleotide sites, supporting the use of site-independence assumption for analyzing HVS-I. Potential extensions of our methodology include its application to estimation of mutation rates in other genetic regions, like Y chromosome short tandem repeats.
(ProQuest: ... denotes formulae omitted.)
IT has long been known that different regions in the genome mutate at vastly different rates (Tamura and Nei 1993). In particular, for the mitochondrial DNA (mtDNA) two hypervariable segments (HVS) have been identified and named HVS-I and HVS-II. Even within these segments, the mutation rates of the various sites are not fixed. Tamuraand Nei (1993) showed that there is strong statistical support for a Gamma "prior" distribution of mutation rates across the mtDNA control region (which contains both HVS-I and HVS-II), with a shape parameter α = 0.1, implying many orders of magnitude difference in rates between the fastest and slowest mutating sites in these segments. Yang (1993, 1994) described methodologies for integrating this Gamma prior into maximum-likelihood (ML) phylogeny estimation.
Beyond the distribution of mutation rates, the next step is to estimate site-specific mutation and/or substitution rates. These are potentially important for understanding functionality of various genetic regions, as different functions are likely to impose selection or sequence constraints and these can be inferred through a good estimationmethodology for site-specific rates. For example, in mtDNAHVS-I several termination-associated sequences (TAS) have been identified, on the basis of sequence properties and conservation indexes. These are suspected to play a central role in regulation between replication termination and elongation of the mtDNA (Falkenberg et al. 2007). If these suspicions are well founded, we would expect strong structural constraints to apply to these sequences and hence expect them to be subject to purifying selection. Although mutationsmight occur at a similar rate to the rest of HVS-I, the resulting variants would be selected against. In the presence of selection, neutral theory assumptions made by practically every estimation approach, including ours below, are violated, but the reduced diversity due to selection is still expected to lead to lower estimates. Thus, the task of identifying (or verifying) the functionality of suchregions can be addressed in a hypothesis-testing framework for the "null" hypothesis of neutrality (under which the statistical model is valid and the rates should be "average") against the alternative of slower rates.
Numerous approaches have been developed for estimating site-specific mutation rates. One flavor (e.g., Yang 1995; Siepel and Haussler 2005) is based on analyzing the mutation rates as a Markov process and hence identifying their sequential correlation. …