Stepwise detection of recombination breakpoints in sequence alignments (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/21/5/589/48962322/bioinformatics_21_5_589.pdf

Stepwise detection of recombination breakpoints in sequence alignments

BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 5 2005, pages 589–595 doi:10.1093/bioinformatics/bti040 Sequence analysis Stepwise detection of recombination breakpoints in sequence alignments Jinko Graham1,∗ , Brad McNeney1 and Françoise Seillier-Moiseiwitsch2 1 Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, Canada V5A 1S6 and 2 Division of Biostatistics and Bioinformatics, Lombardi Cancer Center, Georgetown University, Washington, DC, WA 20057, USA ABSTRACT Motivation: We propose a stepwise approach to identify recombination breakpoints in a sequence alignment. The approach can be applied to any recombination detection method that uses a permutation test and provides estimates of breakpoints. Results: We illustrate the approach by analyses of a simulated dataset and alignments of real data from HIV-1 and human chromosome 7. The presented simulation results compare the statistical properties of one-step and two-step procedures. More breakpoints are found with a two-step procedure than with a single application of a given method, particularly for higher recombination rates. At higher recombination rates, the additional breakpoints were located at the cost of only a slight increase in the number of falsely declared breakpoints. However, a large proportion of breakpoints still go undetected. Availability: A makefile and C source code for phylogenetic profiling and the maximum χ 2 method, tested with the gcc compiler on Linux and WindowsXP, are available at http://stat-db.stat.sfu.ca/stepwise/ Contact: INTRODUCTION Recombination leads to different evolutionary histories for different sites within samples of sequences from a population. The multiple correlated histories that result provide more evolutionary information than a single common history. Thus, the presence of recombination can improve the estimation and testing of genetic parameters in population biology. For example, genomic regions with recombination are preferred for detecting geographic subdivision when migration between subpopulations is relatively low (Hudson et al., 1992). Per se, locating recombination breakpoints plays a role in understanding gene genealogies (e.g. DuBose et al., 1988) and haplotype structure within populations (e.g. Daly et al., 2001). Locating breakpoints is also essential to assessing the possibility of an individual being infected by two genetically diverse viral strains (that have subsequently recombined). For instance, in the case of HIV-1, there is an evidence of recombination of strains from the same subtype (e.g. Groenink et al., 1992) and different subtypes (e.g. Leitner et al., 1995; Fang et al., 2004). The strength of signal left by a recombination event varies and is affected by factors such as the mutation rate, the level of divergence of the parental sequences that gave rise to the recombinant, how ∗ To whom correspondence should be addressed. far back in time the recombination event occurred and the relative numbers of descendants of the recombinant and parental sequences in the alignment (e.g. Weiller, 1998; Posada et al., 2002). Many recombination events have little or no impact on the data and so are difficult or impossible to detect (Hudson and Kaplan, 1985; Myers and Griffiths, 2003). Likelihood methods for inference of recombination rates (e.g. Griffiths and Marjoram, 1996; Kuhner et al., 2000; Nielsen, 2000; Fearnhead and Donnelly, 2001) can take into account such undetectable events. A variety of methods have been developed to detect recombination within alignments. Posada and Crandall (2001) provide a review and comparison (see also Brown et al., 2001; Wiuf et al., 2001). Several of these methods also estimate the location of breakpoints within the alignment and are therefore useful for locating breakpoints not proposed in advance. Since some recombination events would leave stronger signals than others, conditioning on previously found breakpoints can reduce the unexplained variability in the data and improve a method’s ability to find further breakpoints. We introduce such a stepwise approach. The approach can be applied with any permutation-based method for detecting recombination, which also identifies breakpoint locations. Examples of such methods include phylogenetic profiling (Phylpro) (Weiller, 1998) and the maximum χ 2 (MaxChi) method (Smith, 1992), as implemented by Posada and Crandall (2001) and Wiuf et al. (2001), Chimaera (Posada, 2002) and the Geneconv method (Sawyer, 1989). We illustrate the approach with analyses of a simulated dataset and alignments of HIV-1 env gene sequences and single nucleotide polymorphisms (SNPs) in a 150 kb region of human chromosome 7. Following this, we present simulation results comparing statistical properties of the one-step and two-step procedures. SYSTEMS AND METHODS For detecting recombination breakpoints that are not proposed in advance, several methods may be used in conjunction with permutation tests. Loosely speaking, each possible breakpoint or fragment with different ancestry within the alignment is considered, and the strength of its recombination signal is summarized by some site- or fragment-specific measure. The set of rankordered measures may then be considered in permutation tests. Assuming that sites have independent mutation processes with identically distributed outcomes, their permutations are equally likely outcomes of the same random evolutionary process, under the null hypothesis that all sites share the same ancestry (no recombination). A null distribution for the rank-ordered measures can thus be obtained by permuting sites in the alignment. The © The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: 589 Received on May 18, 2004; revised on July 29, 2004; accepted on September 3, 2004 Advance Access publication September 23, 2004 J.Graham et al. A new stepwise procedure Permutation tests assume that sites with the same ancestry are independent outcomes of the same random evolutionary process. Known breakpoints define segments of the alignment whose ancestries may differ. Hence, permutation of sites is appropriate within these segments but not between them. The idea of permuting sites within segments may be used in a stepwise procedure that, at each step, conditions on breakpoints declared at earlier stages of the analysis. Specifically, in the first step, the null hypothesis is that there are no recombination breakpoints and the permutation null distribution is obtained by permuting all sites in the alignment. If any breakpoints are declared, we proceed to a second step in which the null hypothesis is that there are no additional recombination breakpoints. At the second step, the null distribution is constructed by permuting sites within segments of the alignment with common ancestries given the breakpoints declared at the first stage. Conditioning on previously found breakpoints reduces vari (...truncated)