Recco: recombination analysis using cost optimization (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/22/9/1064.full.pdf

Recco: recombination analysis using cost optimization

Jochen Maydt 0 Thomas Lengauer 0 0 Max-Planck Institut fu r Informatik , Saarbru cken, Germany Motivation: Recombination plays an important role in the evolution of many pathogens, such as HIVor malaria. Despite substantial prior work, there is still a pressing need for efficient and effective methods of detecting recombination and analyzing recombinant sequences. Results: We introduce Recco, a novel fast method that, given a multiple sequence alignment, scores the cost of obtaining one of the sequences from the others by mutation and recombination. The algorithm comes with an illustrative visualization tool for locating recombination breakpoints. We analyze the sequence alignment with respect to all choices of the parameter a weighting recombination cost against mutation cost. The analysis of the resulting cost curve yields additional information as to which sequence might be recombinant. On random genealogies Recco is comparable in its power of detecting recombination with the algorithm Geneconv (Sawyer, 1989). For specific relevant recombination scenarios Recco significantly outperforms Geneconv. Availability: Recco is available at http://bioinf.mpi-inf.mpg.de/recco/ Contact: The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: 1 INTRODUCTION In comparison with tree-based phylogenetic analysis procedures, procedures for analyzing recombination are immature. Recent power studies on recombination detection methods uncovered that it can be hard even to decide whether there is recombination in a set of aligned sequences (Posada and Crandall, 2001; Wiuf et al., 2001). The questions, which sequence is the recombinant and where there are recombination breakpoints, are even more challenging. Methods for analyzing recombination in molecular sequences fall into at least four categories: (1) recombination detection methods, (2) methods for deriving bounds on the number of recombination events (Song and Hein, 2005; Song et al., 2005), (3) network methods such as SplitsTree (Huson, 1998) and (4) inference methods based on the coalescent (Kuhner et al., 2000; Fearnhead and Donnelly, 2001). Each category is appropriate for different problem settings and can provide independent information on the recombination signal contained in a dataset. As recombination detection programs are conceptually closest to our approach, we limit our comments to this category in the following paragraphs. In the past 20 years, more than 20 methods have been developed for detecting the presence of recombination in a sequence dataset. More details and an evaluation of their accuracy can be found in recent power studies (Brown et al., 2001; Posada and Crandall, 2001; Wiuf et al., 2001) and in book chapters (Salminen, 2003; Husmeier and Wright, 2004) dealing with recombination detection. To whom correspondence should be addressed. Links to implementations are on the website of D. Robertson (http:// bioinf.man.ac.uk/~robertson/recombination/). The earliest methods use statistical tests for checking whether substitutions in the alignment are non-uniformly distributed, i.e. whether substitutions are significantly clustered owing to recombination, back mutation or other effects. Popular methods using this principle are the maximum x2-test (Smith, 1992; Spencer, 2003) and Geneconv (Sawyer, 1989). Geneconv searches for the longest conserved fragment between two sequences and determines whether it is significant. Extensions also allow for including mutations in the fragments. Despite the lack of any explicit model of evolution, substitution distribution methods are quite competitive (Posada and Crandall, 2001), with Geneconv performing as one of the best methods. Many other methods detect a change of the phylogenetic distance signal in adjacent areas of the alignment. Some popular methods are PLATO (Grassly and Holmes, 1997), TOPAL (McGuire and Wright, 2000), PhyPro (Weiller, 1998) and SimPlot (Lole et al., 1999). These methods either use a global reference tree or a sliding window to detect local changes in the topology of the phylogenetic distance signal. A global reference tree is problematic for strong recombination signals, as a dominant phylogenetic tree cannot be identified anymore and artifacts are introduced into the tree (Schierup and Hein, 2000). A fixed window size determines the trade-off between localizing the breakpoints accurately and the ability to correctly infer recombination. Even though all these approaches use a model of (tree-like) evolution, they lack a model for recombination. Consequently, they only look for indirect evidence of recombination and thus may falsely detect recombination. Another group of methods is more closely related to the coalescent framework as they infer a restricted version of the evolutionary history subject to recombination. RecPars (Hein, 1993) allows for a different tree-like evolutionary history at each position and tries to heuristically minimize the cost for substitution at each position under the associated tree as well as the cost for topology changes along the sequence of trees. Thus, RecPars does away with the window-size parameter, but adds recombination and mutation cost parameters. Husmeier and McGuire (2003) translate the idea of RecPars into a statistical framework and maximize the likelihood of topology changes and mutations. While being accurate, a major drawback of this approach is the high computational effort that makes this method inapplicable for datasets with more than a few sequences. We present Recco, a fast and simple method for detecting recombination in a set of sequences and locating putative recombination breakpoints that is based on cost minimization and dynamic programming. The basic idea is to construct each sequence in the alignment (temporarily considered the putative recombinant) in turn, from the other sequences in the alignment using only the mutation and recombination operators. The output of Recco bears some resemblance with SimPlot (Lole et al., 1999) and represents local sequence similarity of the putative recombinant with the other sequences. In contrast to SimPlot, there is no need for a sliding window and hence no limitation of spatial resolution. The minimum cost solution identifies the best recombination breakpoints and also the parental sequences. Recco has only two tunable parameters, recombination and mutation cost. In practice, we only need to consider a single parameter a representing the cost of mutation relative to recombination. We present an approach for finding the values for a at which the solution changes structurally. This can be condensed further into a single indicator for the presence of recombination in the alignment. The dynamic program we use is based on the work of Kececioglu and Gusfield (1998). It employs insertion, deletion, recombination and mutation for producing a single sequence from two other sequences. Our method is a restric (...truncated)