Fast and Robust Characterization of Time-Heterogeneous Sequence Evolutionary Processes Using Substitution Mapping
et al. (2012) Fast and Robust Characterization of Time-Heterogeneous Sequence Evolutionary
Processes Using Substitution Mapping. PLoS ONE 7(3): e33852. doi:10.1371/journal.pone.0033852
Fast and Robust Characterization of Time- Heterogeneous Sequence Evolutionary Processes Using Substitution Mapping
Jonathan Romiguier 0
Emeric Figuet 0
Nicolas Galtier 0
Emmanuel J. P. Douzery 0
Bastien Boussau 0
Julien Y. Dutheil 0
Vincent Ranwez 0
David Liberles, University of Wyoming, United States of America
0 1 Institut des Sciences de l'Evolution de Montpellier, CNRS-Universite Montpellier 2, Montpellier, France, 2 Laboratoire de Biome trie et Biologie Evolutive, CNRS-Universite Lyon 1, Villeurbanne, France, 3 Unite Mixte de Recherche Ame lioration ge ne tique et adaptation des plantes me diterrane ennes et tropicales, Montpellier SupAgro, Montpellier, France, 4 Department of Integrative Biology, University of California , Berkeley, California , United States of America
Genes and genomes do not evolve similarly in all branches of the tree of life. Detecting and characterizing the heterogeneity in time, and between lineages, of the nucleotide (or amino acid) substitution process is an important goal of current molecular evolutionary research. This task is typically achieved through the use of non-homogeneous models of sequence evolution, which being highly parametrized and computationally-demanding are not appropriate for large-scale analyses. Here we investigate an alternative methodological option based on probabilistic substitution mapping. The idea is to first reconstruct the substitutional history of each site of an alignment under a homogeneous model of sequence evolution, then to characterize variations in the substitution process across lineages based on substitution counts. Using simulated and published datasets, we demonstrate that probabilistic substitution mapping is robust in that it typically provides accurate reconstruction of sequence ancestry even when the true process is heterogeneous, but a homogeneous model is adopted. Consequently, we show that the new approach is essentially as efficient as and extremely faster than (up to 25 000 times) existing methods, thus paving the way for a systematic survey of substitution process heterogeneity across genes and lineages.
-
. These authors contributed equally to this work.
Mapping the history of nucleotide or amino-acid changes onto
the evolutionary history of a gene, as depicted by a phylogenetic
tree, is of central interest to researchers in molecular evolution. This
procedure, called mutation or substitution mapping, is useful for
characterizing the molecular evolutionary processes of DNA and
protein sequences, and their variations across sites and lineages.
Substitution mapping has been successfully applied to study various
aspects of molecular evolution, including coevolution [1], [2],
selective constraints in proteins [3], deviations from the molecular
clock hypothesis [4], and changes in selective regimes [5]. Beyond
this, substitution mapping has also enabled the implementation of a
number of models that were otherwise intractable [6],[7].
Over the past 10 years, several inference methods have been
developed to achieve substitution mapping. Formally, the problem
is to identify, for every site in a sequence alignment, the kinds of
character changes that occurred, and their location in the
underlying phylogeny. So a substitution mapping method would
take an alignment and a tree as input and return, as output, an
estimate of the number/nature of substitutions that have occurred,
for each site of the alignment and each branch of the tree. The
naive substitution mapping procedure [8] involves first
reconstructing all ancestral sequences at each node of the
phylogenetic tree. Secondly, for each site, one substitution is
mapped on a branch when two different states are observed for
this site at the two extremities of the branch. The main drawback
of such an approach is that it overlooks the uncertainty of the
ancestral sequence inference.
Two improved mapping methods have been proposed:
Bayesian Mutational Mapping (BMM, [9]) and Probabilistic
Substitution Mapping (PSM, [1],[10]). They both use Markov
chains to model the substitution process and account for the
uncertainty in the ancestral states [11], [12]. BMM is a procedure
that generates a substitution scenario compatible with the data,
together with its associated likelihood. This procedure was not
designed to produce human-readable substitution maps, but rather
to integrate a statistic of interest over the set of possible substitution
maps. Because it is a sampling procedure, BMM is fairly
computer-expensive, although some more stable or efficient
samplers have been proposed lately [13], [14], [15]. PSM is an
analytical procedure, which computes the probability distribution
of the number of substitutions that occurred at each site of the
alignment and each branch of the phylogenetic tree. Dutheil et al
2005 [1] report how to compute the mean number of total
substitutions per branch and site, but it is also possible to compute
higher-order moments of the distribution, or distinguish between
different types of substitutions ([14],[15] and the present study).
PSM is a maximum likelihood solution of BMM for some
particular statistics (the mean of the branch and site-specific
distributions of the expected number of substitutions in the case of
Dutheil et al 2005 [1]) and is therefore quite fast to compute for a
given tree and substitution model, which is a significant advantage
with respect to the increasing amount of molecular data provided
by high-throughput sequencing. In addition, the relative simplicity
and computer efficiency of substitution mapping procedures have
promoted them for use in several analyses (e.g. [16]). They have
been shown to facilitate parameter estimation of complex models
when used within expectation-maximization procedures [17].
Adequate statistics based on substitution maps could therefore
serve as straight-forward descriptors of molecular evolution that
can be used as proxies for more complex ones.
One of the major advantages of substitution mapping is its power
to detect and characterize time-heterogeneous processes, i.e.
processes that vary across branches of the tree. Such variations,
when identified, can be linked to variations in selective pressure (e.g.
[18]) and mutation/fixation biases (e.g. [19]), or linked to
macroscopic features of species such as effective population size
(e.g. [20], [21]), ecological preferences [22] or life-history traits [23].
To detect heterogeneous processes, explicit models of
nonhomogeneous sequence evolution have been implemented in the
maximum-likelihood or Bayesian frameworks [22], [24], [25], [26].
However, these parameter-rich models could lead to
overparametrization issues and are computationally demanding, so
their usage is limited to relatively small subsets of the large amounts
of (...truncated)