Academic journal article Genetics

Inference of Gene Flow in the Process of Speciation: An Efficient Maximum-Likelihood Method for the Isolation-with-Initial-Migration Model

Academic journal article Genetics

Inference of Gene Flow in the Process of Speciation: An Efficient Maximum-Likelihood Method for the Isolation-with-Initial-Migration Model

Article excerpt

(ProQuest: ... denotes formulae omitted.)

THE two-deme, isolation-with-migration (IM) model is a population genetic model in which, at some point in the past, an ancestral population divided into two subpopulations. After the division, these subpopulations exchanged migrants at a constant rate until the present. The IM model has become one of the most popular probabilistic models in use to study genetic diversity under gene flow and population structure. Although applicable to populations within species, many researchers are using it to detect gene flow between diverging populations and to investigate the role of gene flow in the process of speciation. A meta-analysis of published research articles that used the IM model in the context of speciation can be found in Pinho and Hey (2010).

Several authors have developed computational methods to fit IM models to real DNA data. Some of the most-used programs are aimed at data sets consisting of a large number of sequences from a small number of loci. This is the case of MDIV (Nielsen and Wakeley 2001), IM (Hey and Nielsen 2004; Hey 2005), IMa (Hey and Nielsen 2007), and IMa2 (Hey 2010), which rely on Bayesian Markov chain Monte Carlo (MCMC) methods to estimate the model parameters and are computationally very intensive.

In the past decade, the availability of large data sets spanning the entire genome has increased significantly. However, the MCMC-based implementations of the IM model referred to above are computationally expensive even for small numbers of loci, and their running times increase linearly with the number of loci (Wang and Hey 2010). Fitting an IM model also provides a rather simplified picture of the divergence process, which for some research purposes is clearly insufficient (for example, if one wishes to know whether a process of sympatric speciation has been completed, or whether gene flow occurred due to secondary contact). In addition, Becquet and Przeworski (2009) and Strasburg and Rieseberg (2010) showed that inference based on the programs IM and IMa can become unreliable if any of the assumptions made about population structure, recombination, or linkage is severely violated. For these reasons, there has been a significant increase in the demand for methods that not only scale well to genomesized data, but are also able to estimate increasingly realistic models.

To improve efficiency and scalability, one possible strategy is to work with summary statistics rather than full data patterns. The MCMC-based program MIMAR of Becquet and Przeworski (2007, 2009) uses the four summary statistics studied by Wakeley and Hey (1997) to fit the IM model, and drops the assumption of no intralocus recombination. Gutenkunst et al. (2009) introduced a method based on the joint sample frequency spectrum (JSFS) that is able to fit a range of demographic models incorporating multiple populations, periods of migration and admixture, splits and joins of populations, and changes in population sizes. Based on the same type ofdata, the more recent implementation of Kamm et al. (2016) can already deal with a large number of individuals and populations, but does not yet include gene flow.

Genome-scale data sets, even when stemming from just a few individuals, tend to be more informative than data sets consisting of many individuals but covering only a relatively short genomic region. In fact, as the sample size for a single locus increases, the probability that an extra sequence adds a deep (i.e., informative) branch to the coalescent tree quickly becomes negligible (see for example Hein et al. 2005, pp. 2829). Data sets of a small number of individuals per locus are also more suitable for likelihood-based inference: if at each locus the observation consists only of a few sequences, the coalescent process of these sequences is relatively simple and can more easily be used to derive the likelihood for the locus concerned.

Among the methods designed for whole-genome sequence data of only a few individuals are those of Mailund etal. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.