Evolutionary Distances in the Twilight Zone—A Rational Kernel Approach
et al. (2010) Evolutionary Distances in the Twilight Zone-A Rational Kernel Approach. PLoS
ONE 5(12): e15788. doi:10.1371/journal.pone.0015788
Evolutionary Distances in the Twilight Zone-A Rational Kernel Approach
Roland F. Schwarz 0
William Fletcher 0
Frank Fo rster 0
Benjamin Merget 0
Matthias Wolf 0
Jo rg Schultz 0
Florian Markowetz 0
Wayne Delport, Prognosys Biosciences, United States of America
0 1 Cancer Research UK Cambridge Research Institute, University of Cambridge, Cambridge, United Kingdom, 2 Department of Genetics, Evolution and Environment and Centre for Mathematics and Physics in the Life Sciences and Experimental Biology, University College London , London , United Kingdom , 3 Department of Bioinformatics, Biocenter, University of Wu rzburg , W u rzburg , Germany
Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignmentfree methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.
-
Funding: This work was partially funded by Cancer Research UK. Further support was provided by the Deutsche Forschungsgemeinschaft (DFG) grant (Mu-2831/
1-1). WF is financially supported by an Engineering and Physical Sciences Research Council/Medical Research Council Doctoral Training Centre studentship. The
funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
State-of-the art phylogenetic reconstruction methods are
currently being challenged. For a long time, multiple sequence
alignments followed by maximum-likelihood (ML) tree
reconstruction have been seen as the computationally expensive gold
standard for phylogenetic analyses [1,2]. Distance approaches that
base their inference on summary statistics have traditionally been
seen as a fast but less precise alternative [3]. However, recent
results point out that the gap between ML and distance methods
may be less pronounced than previously thought. For example, the
expected required sequence length for the reconstructed tree to
converge to the true tree phylogeny is not worse in distance-based
approaches than in ML [4]. Additionally the quality of the
multiple sequence alignment heavily affects reconstruction
accuracy, a situation worsened by the NP-hardness of the alignment
problem and the heuristics used to cope with it [59]. The
problem of alignment errors arises especially on large-scale
phylogenies with many taxa that span a broad divergence range
[10], where many homologies lie in the twilight-zone of sequence
alignments [11].
In the light of these findings, alignment-free distance-based
reconstruction methods deserve special attention, as they
circumvent potential pitfalls of the multiple alignment approach,
especially with respect to divergent sequences, and can be
advantageous in speed possibly without sacrificing reconstruction
accuracy. Unfortunately many purely alignment-free approaches
[12,13] lack unique biological motivation (for a comparison see
also [14]). Joint estimation of trees and alignments is
computationally expensive and relies heavily on heuristics and/or sampling
approaches [1519]. The question of reconstructing phylogenies
directly without multiple alignment has only recently been tackled
[20] with promising results. We follow the basic principles of this
approach but here wish to present the phylogenetic reconstruction
problem in a different light.
Since there exists a one-to-one relationship between binary trees
and additive metrics [21] the phylogenetic problem of finding the
true tree is equivalent to finding the true additive dissimilarity
matrix. Finding additive distances is hard, thus distance-based
approaches usually aim at finding a distance which is as close as
possible to the true additive one, so that the tree reconstruction
process which turns these non-additive distances into additive trees
finds the true tree as often as possible. Metrics in general, including
additive distances, can be thought of as being induced by a dot
product v:,:w in some Hilbert space of possibly infinite
dimension [22]. Key to phylogenetic reconstruction is constructing
a Hilbert space and associated dot-product such that distances
between sequences are indeed a measure of evolutionary
divergence. Doing this explicitly is impossible, if the space is of
infinite dimension. However, it can be achieved implicitly by
applying the so-called kernel-trick [22]: A positive-definite (pd)
kernel function k(:,:) in the input space (i.e. directly on the
sequences in our case) computes the scalar value of the
dotproduct in the Hilbert space without explicitly constructing it.
The kernel trick has been applied successfully in a variety of
different fields, including natural language processing, face
recognition, speech recognition and computational biology. Here
we extend its use to the problem of phylogenetic reconstruction.
The major challenge here is finding the right pd kernel. The
pairwise similarity measure between sequences must map
sequences to an evolutionary feature space ruled by the
modification of sequences in terms of insertions, deletions and
substitutions. The natural distance in this space should then come
as close as possible to the true evolutionary distance on the
sequences.
In this article we derive such a kernel. Making use of classical
results from global pairwise alignment we show how a different
formulation of the alignment problem can map sequences to a
feature space of insertions, deletions and substitutions and gives
rise to a pd kernel. We study this similarity measure in its
topological reconstruction accuracy of phylogenetic trees from
simulated and real data. We sh (...truncated)