Evolutionary Distances in the Twilight Zone—A Rational Kernel Approach (pdf)

Article PDF cannot be displayed. You can download it here:

http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0015788&type=printable

Evolutionary Distances in the Twilight Zone—A Rational Kernel Approach

et al. (2010) Evolutionary Distances in the Twilight Zone-A Rational Kernel Approach. PLoS ONE 5(12): e15788. doi:10.1371/journal.pone.0015788 Evolutionary Distances in the Twilight Zone-A Rational Kernel Approach Roland F. Schwarz 0 William Fletcher 0 Frank Fo rster 0 Benjamin Merget 0 Matthias Wolf 0 Jo rg Schultz 0 Florian Markowetz 0 Wayne Delport, Prognosys Biosciences, United States of America 0 1 Cancer Research UK Cambridge Research Institute, University of Cambridge, Cambridge, United Kingdom, 2 Department of Genetics, Evolution and Environment and Centre for Mathematics and Physics in the Life Sciences and Experimental Biology, University College London , London , United Kingdom , 3 Department of Bioinformatics, Biocenter, University of Wu rzburg , W u rzburg , Germany Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignmentfree methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets. - Funding: This work was partially funded by Cancer Research UK. Further support was provided by the Deutsche Forschungsgemeinschaft (DFG) grant (Mu-2831/ 1-1). WF is financially supported by an Engineering and Physical Sciences Research Council/Medical Research Council Doctoral Training Centre studentship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. State-of-the art phylogenetic reconstruction methods are currently being challenged. For a long time, multiple sequence alignments followed by maximum-likelihood (ML) tree reconstruction have been seen as the computationally expensive gold standard for phylogenetic analyses [1,2]. Distance approaches that base their inference on summary statistics have traditionally been seen as a fast but less precise alternative [3]. However, recent results point out that the gap between ML and distance methods may be less pronounced than previously thought. For example, the expected required sequence length for the reconstructed tree to converge to the true tree phylogeny is not worse in distance-based approaches than in ML [4]. Additionally the quality of the multiple sequence alignment heavily affects reconstruction accuracy, a situation worsened by the NP-hardness of the alignment problem and the heuristics used to cope with it [59]. The problem of alignment errors arises especially on large-scale phylogenies with many taxa that span a broad divergence range [10], where many homologies lie in the twilight-zone of sequence alignments [11]. In the light of these findings, alignment-free distance-based reconstruction methods deserve special attention, as they circumvent potential pitfalls of the multiple alignment approach, especially with respect to divergent sequences, and can be advantageous in speed possibly without sacrificing reconstruction accuracy. Unfortunately many purely alignment-free approaches [12,13] lack unique biological motivation (for a comparison see also [14]). Joint estimation of trees and alignments is computationally expensive and relies heavily on heuristics and/or sampling approaches [1519]. The question of reconstructing phylogenies directly without multiple alignment has only recently been tackled [20] with promising results. We follow the basic principles of this approach but here wish to present the phylogenetic reconstruction problem in a different light. Since there exists a one-to-one relationship between binary trees and additive metrics [21] the phylogenetic problem of finding the true tree is equivalent to finding the true additive dissimilarity matrix. Finding additive distances is hard, thus distance-based approaches usually aim at finding a distance which is as close as possible to the true additive one, so that the tree reconstruction process which turns these non-additive distances into additive trees finds the true tree as often as possible. Metrics in general, including additive distances, can be thought of as being induced by a dot product v:,:w in some Hilbert space of possibly infinite dimension [22]. Key to phylogenetic reconstruction is constructing a Hilbert space and associated dot-product such that distances between sequences are indeed a measure of evolutionary divergence. Doing this explicitly is impossible, if the space is of infinite dimension. However, it can be achieved implicitly by applying the so-called kernel-trick [22]: A positive-definite (pd) kernel function k(:,:) in the input space (i.e. directly on the sequences in our case) computes the scalar value of the dotproduct in the Hilbert space without explicitly constructing it. The kernel trick has been applied successfully in a variety of different fields, including natural language processing, face recognition, speech recognition and computational biology. Here we extend its use to the problem of phylogenetic reconstruction. The major challenge here is finding the right pd kernel. The pairwise similarity measure between sequences must map sequences to an evolutionary feature space ruled by the modification of sequences in terms of insertions, deletions and substitutions. The natural distance in this space should then come as close as possible to the true evolutionary distance on the sequences. In this article we derive such a kernel. Making use of classical results from global pairwise alignment we show how a different formulation of the alignment problem can map sequences to a feature space of insertions, deletions and substitutions and gives rise to a pd kernel. We study this similarity measure in its topological reconstruction accuracy of phylogenetic trees from simulated and real data. We sh (...truncated)