Recco: recombination analysis using cost optimization
Jochen Maydt
0
Thomas Lengauer
0
0
Max-Planck Institut fu r Informatik
, Saarbru cken,
Germany
Motivation: Recombination plays an important role in the evolution of many pathogens, such as HIVor malaria. Despite substantial prior work, there is still a pressing need for efficient and effective methods of detecting recombination and analyzing recombinant sequences. Results: We introduce Recco, a novel fast method that, given a multiple sequence alignment, scores the cost of obtaining one of the sequences from the others by mutation and recombination. The algorithm comes with an illustrative visualization tool for locating recombination breakpoints. We analyze the sequence alignment with respect to all choices of the parameter a weighting recombination cost against mutation cost. The analysis of the resulting cost curve yields additional information as to which sequence might be recombinant. On random genealogies Recco is comparable in its power of detecting recombination with the algorithm Geneconv (Sawyer, 1989). For specific relevant recombination scenarios Recco significantly outperforms Geneconv. Availability: Recco is available at http://bioinf.mpi-inf.mpg.de/recco/ Contact: The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email:
1 INTRODUCTION
In comparison with tree-based phylogenetic analysis procedures,
procedures for analyzing recombination are immature. Recent
power studies on recombination detection methods uncovered
that it can be hard even to decide whether there is recombination
in a set of aligned sequences (Posada and Crandall, 2001; Wiuf
et al., 2001). The questions, which sequence is the recombinant
and where there are recombination breakpoints, are even more
challenging.
Methods for analyzing recombination in molecular sequences fall
into at least four categories: (1) recombination detection methods,
(2) methods for deriving bounds on the number of recombination
events (Song and Hein, 2005; Song et al., 2005), (3) network
methods such as SplitsTree (Huson, 1998) and (4) inference methods
based on the coalescent (Kuhner et al., 2000; Fearnhead and
Donnelly, 2001). Each category is appropriate for different problem
settings and can provide independent information on the
recombination signal contained in a dataset. As recombination detection
programs are conceptually closest to our approach, we limit our
comments to this category in the following paragraphs.
In the past 20 years, more than 20 methods have been developed
for detecting the presence of recombination in a sequence dataset.
More details and an evaluation of their accuracy can be found in
recent power studies (Brown et al., 2001; Posada and Crandall,
2001; Wiuf et al., 2001) and in book chapters (Salminen, 2003;
Husmeier and Wright, 2004) dealing with recombination detection.
To whom correspondence should be addressed.
Links to implementations are on the website of D. Robertson (http://
bioinf.man.ac.uk/~robertson/recombination/).
The earliest methods use statistical tests for checking whether
substitutions in the alignment are non-uniformly distributed, i.e.
whether substitutions are significantly clustered owing to
recombination, back mutation or other effects. Popular methods using
this principle are the maximum x2-test (Smith, 1992; Spencer,
2003) and Geneconv (Sawyer, 1989). Geneconv searches for the
longest conserved fragment between two sequences and determines
whether it is significant. Extensions also allow for including
mutations in the fragments. Despite the lack of any explicit model of
evolution, substitution distribution methods are quite competitive
(Posada and Crandall, 2001), with Geneconv performing as one of
the best methods.
Many other methods detect a change of the phylogenetic distance
signal in adjacent areas of the alignment. Some popular methods
are PLATO (Grassly and Holmes, 1997), TOPAL (McGuire and
Wright, 2000), PhyPro (Weiller, 1998) and SimPlot (Lole et al.,
1999). These methods either use a global reference tree or a sliding
window to detect local changes in the topology of the phylogenetic
distance signal. A global reference tree is problematic for strong
recombination signals, as a dominant phylogenetic tree cannot
be identified anymore and artifacts are introduced into the tree
(Schierup and Hein, 2000). A fixed window size determines
the trade-off between localizing the breakpoints accurately and
the ability to correctly infer recombination. Even though all these
approaches use a model of (tree-like) evolution, they lack a model for
recombination. Consequently, they only look for indirect evidence
of recombination and thus may falsely detect recombination.
Another group of methods is more closely related to the
coalescent framework as they infer a restricted version of the evolutionary
history subject to recombination. RecPars (Hein, 1993) allows for a
different tree-like evolutionary history at each position and tries to
heuristically minimize the cost for substitution at each position
under the associated tree as well as the cost for topology changes
along the sequence of trees. Thus, RecPars does away with the
window-size parameter, but adds recombination and mutation
cost parameters. Husmeier and McGuire (2003) translate the idea
of RecPars into a statistical framework and maximize the likelihood
of topology changes and mutations. While being accurate, a major
drawback of this approach is the high computational effort that
makes this method inapplicable for datasets with more than a
few sequences.
We present Recco, a fast and simple method for detecting
recombination in a set of sequences and locating putative
recombination breakpoints that is based on cost minimization and dynamic
programming. The basic idea is to construct each sequence in the
alignment (temporarily considered the putative recombinant) in
turn, from the other sequences in the alignment using only the
mutation and recombination operators. The output of Recco
bears some resemblance with SimPlot (Lole et al., 1999) and
represents local sequence similarity of the putative recombinant with
the other sequences. In contrast to SimPlot, there is no need for
a sliding window and hence no limitation of spatial resolution.
The minimum cost solution identifies the best recombination
breakpoints and also the parental sequences. Recco has only two tunable
parameters, recombination and mutation cost. In practice, we only
need to consider a single parameter a representing the cost of
mutation relative to recombination. We present an approach for
finding the values for a at which the solution changes structurally.
This can be condensed further into a single indicator for the
presence of recombination in the alignment.
The dynamic program we use is based on the work of Kececioglu
and Gusfield (1998). It employs insertion, deletion, recombination
and mutation for producing a single sequence from two other
sequences. Our method is a restric (...truncated)