Academic journal article Genetics

A Bayesian Approach to Inferring the Phylogenetic Structure of Communities from Metagenomic Data

Academic journal article Genetics

A Bayesian Approach to Inferring the Phylogenetic Structure of Communities from Metagenomic Data

Article excerpt

ABSTRACT Metagenomics provides a powerful new tool set for investigating evolutionary interactions with the environment. However, an absence of model-based statistical methods means that researchers are often not able to make full use of this complex information. We present a Bayesian method for inferring the phylogenetic relationship among related organisms found within metagenomic samples. Our approach exploits variation in the frequency of taxa among samples to simultaneously infer each lineage haplotype, the phylogenetic tree connecting them, and their frequency within each sample. Applications of the algorithm to simulated data show that our method can recover a substantial fraction of the phylogenetic structure even in the presence of high rates of migration among sample sites. We provide examples of the method applied to data from green sulfur bacteria recovered from an Antarctic lake, plastids from mixed Plasmodium falciparum infections, and virulent Neisseria meningitidis samples.

(ProQuest: ... denotes formulae omitted.)

METAGENOMICS - purifying and sequencing DNA from environmental samples without any culturing step-represents an important new tool for investigating how microbes interact with mold and adapt to their environ- ments (Tyson et al. 2004; Allen and Banfield 2005; Gill et al. 2006; Preidis and Versalovic 2009). Metagenomics can also be applied to any situation where genetic variability exists within a sample, such as microbiomes, mixed infections, and cancer. Many metagenomic analyses relate the overall DNA content of samples to environmental phenotypes (Tringe et al. 2005; Kurokawa et al. 2007). We take up a different problem: the reconstruction of organismal composition for each sample. Overall DNA content provides useful informa- tion on overall community function, but many physiological and evolutionary processes may be understood only at the organismal level (Partida-Martinez and Hertweck 2005; Martinez et al. 2009).

Recent improvements in sequencing technology allow the collection of large numbers (.106) of short reads of DNA sequence (40-100 bp) from within a sample (Schmeisser et al. 2007; Bentley et al. 2008). For notational clarity we refer to each sample as a pool. The simplest approach to inferring composition is in terms of the frequency of known sequences within each sample (von Mering et al. 2007; Chaffron et al. 2010). This approach typically works well for assessing variation at broad scales when individual reads can be mapped onto the nearest reference genome within the tree of life (Matsen et al. 2010; Berger et al. 2011; Berger and Stamatakis 2011; Löytynoja et al. 2012). How- ever, at finer scales, and in particular if one is interested in the evolution taking place within the samples themselves, the structure of relationships among organisms will gener- ally not be known in advance and so must be inferred from data.

Figure 1, left, illustrates the evolutionary scenario that we assume underlies the data. The phylogeny's tips corre- spond to individual cells, and color indicates the pool of origin. Since individual reads are typically short and will thus contain limited phylogenetic information, it is not feasible to reconstruct a resolved tree where each read cor- responds to a single taxon. We therefore attempt to infer a simpli fi ed phylogeny in which the terminal nodes repre- sent groups of related organisms or lineages (Figure 1, right). Each lineage defines a haplotype of allele states for the single-nucleotide polymorphisms (SNPs) within the data and makes up a proportion of the organisms within a pool shown by the colored bar. As indicated by the shaded cones, the SNP pattern of organisms within a lineage may vary, due to recent, low-frequency variation or sequencing errors.

One similar-but easier-problem is phasing in diploid organisms. In this case, the goal is to reconstruct haplotypes (i.e., the sequences of the two copies of each chromosome) given the genotypes at each diploid locus. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.