Stepwise detection of recombination breakpoints in sequence alignments
BIOINFORMATICS
ORIGINAL PAPER
Vol. 21 no. 5 2005, pages 589–595
doi:10.1093/bioinformatics/bti040
Sequence analysis
Stepwise detection of recombination breakpoints in
sequence alignments
Jinko Graham1,∗ , Brad McNeney1 and Françoise Seillier-Moiseiwitsch2
1 Department
of Statistics and Actuarial Science, Simon Fraser University, Burnaby, Canada V5A 1S6
and 2 Division of Biostatistics and Bioinformatics, Lombardi Cancer Center, Georgetown University,
Washington, DC, WA 20057, USA
ABSTRACT
Motivation: We propose a stepwise approach to identify recombination breakpoints in a sequence alignment. The approach can be
applied to any recombination detection method that uses a permutation test and provides estimates of breakpoints.
Results: We illustrate the approach by analyses of a simulated dataset
and alignments of real data from HIV-1 and human chromosome 7.
The presented simulation results compare the statistical properties of
one-step and two-step procedures. More breakpoints are found with a
two-step procedure than with a single application of a given method,
particularly for higher recombination rates. At higher recombination
rates, the additional breakpoints were located at the cost of only a
slight increase in the number of falsely declared breakpoints. However,
a large proportion of breakpoints still go undetected.
Availability: A makefile and C source code for phylogenetic profiling
and the maximum χ 2 method, tested with the gcc compiler on Linux
and WindowsXP, are available at http://stat-db.stat.sfu.ca/stepwise/
Contact:
INTRODUCTION
Recombination leads to different evolutionary histories for different
sites within samples of sequences from a population. The multiple
correlated histories that result provide more evolutionary information
than a single common history. Thus, the presence of recombination
can improve the estimation and testing of genetic parameters in population biology. For example, genomic regions with recombination
are preferred for detecting geographic subdivision when migration
between subpopulations is relatively low (Hudson et al., 1992).
Per se, locating recombination breakpoints plays a role in understanding gene genealogies (e.g. DuBose et al., 1988) and haplotype
structure within populations (e.g. Daly et al., 2001). Locating breakpoints is also essential to assessing the possibility of an individual
being infected by two genetically diverse viral strains (that have subsequently recombined). For instance, in the case of HIV-1, there is
an evidence of recombination of strains from the same subtype (e.g.
Groenink et al., 1992) and different subtypes (e.g. Leitner et al.,
1995; Fang et al., 2004).
The strength of signal left by a recombination event varies and is
affected by factors such as the mutation rate, the level of divergence
of the parental sequences that gave rise to the recombinant, how
∗ To
whom correspondence should be addressed.
far back in time the recombination event occurred and the relative
numbers of descendants of the recombinant and parental sequences
in the alignment (e.g. Weiller, 1998; Posada et al., 2002). Many
recombination events have little or no impact on the data and so are
difficult or impossible to detect (Hudson and Kaplan, 1985; Myers
and Griffiths, 2003). Likelihood methods for inference of recombination rates (e.g. Griffiths and Marjoram, 1996; Kuhner et al., 2000;
Nielsen, 2000; Fearnhead and Donnelly, 2001) can take into account
such undetectable events.
A variety of methods have been developed to detect recombination
within alignments. Posada and Crandall (2001) provide a review and
comparison (see also Brown et al., 2001; Wiuf et al., 2001). Several
of these methods also estimate the location of breakpoints within
the alignment and are therefore useful for locating breakpoints not
proposed in advance.
Since some recombination events would leave stronger signals
than others, conditioning on previously found breakpoints can reduce
the unexplained variability in the data and improve a method’s ability
to find further breakpoints. We introduce such a stepwise approach.
The approach can be applied with any permutation-based method
for detecting recombination, which also identifies breakpoint locations. Examples of such methods include phylogenetic profiling
(Phylpro) (Weiller, 1998) and the maximum χ 2 (MaxChi) method
(Smith, 1992), as implemented by Posada and Crandall (2001) and
Wiuf et al. (2001), Chimaera (Posada, 2002) and the Geneconv
method (Sawyer, 1989). We illustrate the approach with analyses
of a simulated dataset and alignments of HIV-1 env gene sequences
and single nucleotide polymorphisms (SNPs) in a 150 kb region of
human chromosome 7. Following this, we present simulation results comparing statistical properties of the one-step and two-step
procedures.
SYSTEMS AND METHODS
For detecting recombination breakpoints that are not proposed in advance,
several methods may be used in conjunction with permutation tests. Loosely
speaking, each possible breakpoint or fragment with different ancestry within
the alignment is considered, and the strength of its recombination signal is
summarized by some site- or fragment-specific measure. The set of rankordered measures may then be considered in permutation tests. Assuming
that sites have independent mutation processes with identically distributed
outcomes, their permutations are equally likely outcomes of the same random evolutionary process, under the null hypothesis that all sites share the
same ancestry (no recombination). A null distribution for the rank-ordered
measures can thus be obtained by permuting sites in the alignment. The
© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email:
589
Received on May 18, 2004; revised on July 29, 2004; accepted on September 3, 2004
Advance Access publication September 23, 2004
J.Graham et al.
A new stepwise procedure
Permutation tests assume that sites with the same ancestry are independent outcomes of the same random evolutionary process. Known breakpoints
define segments of the alignment whose ancestries may differ. Hence, permutation of sites is appropriate within these segments but not between them.
The idea of permuting sites within segments may be used in a stepwise procedure that, at each step, conditions on breakpoints declared at earlier stages of the
analysis. Specifically, in the first step, the null hypothesis is that there are no
recombination breakpoints and the permutation null distribution is obtained
by permuting all sites in the alignment. If any breakpoints are declared, we
proceed to a second step in which the null hypothesis is that there are no
additional recombination breakpoints. At the second step, the null distribution is constructed by permuting sites within segments of the alignment with
common ancestries given the breakpoints declared at the first stage. Conditioning on previously found breakpoints reduces vari (...truncated)