The Power to Detect Recombination Using the Coalescent
Mol. Biol. Evol.
0737-4038
The Power to Detect Recombination Using the Coalescent
Celeste J. Brown 0 1 3 5
Ethan C. Garner 0 1 3 4
A. Keith Dunker 0 1 3
Paul Joyce 1 2 3 5
0 School of Molecular Biosciences, Washington State University
1 School of Molecular Biosciences, Washington State University , Pull- man, Washington 99164-4660
2 Department of Mathematics, University of Idaho
3 Abbreviation: LRT , likelihood ratio test
4 Present address: Department of Biochemistry and Biophysics, University of California at San Francisco
5 Division of Statistics, University of Idaho
There are a wide variety of models for estimating the phylogenetic relationships among amino acid and nucleotide sequences sampled from organisms at the population, species, and kingdom levels (reviewed by Swofford et al. 1996). One of the assumptions of these models is that recombination has not occurred in the history of the sampled sequences. When recombination is present, the relationships may be illustrated by a bifurcating graph (fig. 1), which is a network of relationships between parts of all sequences (Griffith and Marjoram 1996). Most commonly used phylogeny methods do not construct such a network, and other measures, such as construction of separate phylogenies, must be taken to find relationships using these methods. Recombination may also be a problem for tandemly repeated genes where intragenic recombination, also known as gene conversion, may occur at a substantial rate (Wiuf 2000). Before investigators use techniques that assume there is no recombination, they must be convinced that recombination is not present. Even low levels of recombination may have profound effects on phylogeny reconstruction and the conclusions one draws, so using a method that has a good chance of detecting low levels of recombination is important. Statistical power analysis can be used to evaluate various methods to determine the best method. Once a researcher is convinced that a method for detecting recombination has substantial power, that method may be used with confidence. Our purpose is to show that rigorous statistical methods exist for evaluating programs that detect recombination. Power is the probability that a statistical test will reject the null hypothesis. This probability is calculated using a statistical model that includes both the null and alternative hypotheses. Since the calculation depends on the parameters of the model, power is not a single probability, but rather a series of probabilities, one for each parameter value. For tests of recombination, the parameter of interest is the recombination rate, r (recombinations per mutation per site per generation), and the null hypothesis is that there is no recombination in the history of the sample (r 5 0). The power of a test depends
recombination; statistical power; gene conversion; hypothesis testing; likelihood ratio
-
on many factors: the level of significance or type I error,
the sample size, and how far the alternative hypothesis
is from the null hypothesis. The a-level, or type I error,
that is chosen is the probability that the null hypothesis
will be rejected when it is true. This level should not be
too high, since looking for recombination that does not
exist can be an aggravating task. A power analysis uses
data simulated under the null hypothesis and various
alternative hypotheses to determine how often the
method rejects the null hypothesis for a specific type I error
rate. The power should increase toward 100% as r
increases; the method whose power increases the fastest
is the most powerful.
Other researchers have investigated the power of
various methods for detecting gene conversion. PLATO
is a program that detects spatial phylogenetic
heterogeneity and can be used to detect recombination (Grassly
and Holmes 1997). Grassly and Holmes (1997) tested
the power of PLATO to detect recombination by
simulating phylogenies with different topologies under a
single coalescent model and then manually recombining
sequences between phylogenies. They also compared
the results from PLATO with a well-documented
example of recombination among the argF genes of
Neissera. Drouin et al. (1999) also used well-documented
examples of gene conversion to test various programs
for detecting gene conversion. Neither method is
optimal for conducting a power analysis in which thousands
to tens of thousands of data sets generated under various
alternative hypotheses are needed.
The purpose of this note is to increase awareness
that coalescent theory can be used to make direct,
statistically valid comparisons among recombination
detection methods. The coalescent with recombination
(Hudson 1983) is a description of ancestral relationships and
can be described as follows. As one traces the ancestry
of a sample into the past, three possible evolutionary
events may arise: a mutation may have occurred in one
of the ancestral lines, two individuals may have a
common ancestor, or the genetic material in question may
have arisen as a result of recombination (see fig. 1). In
this last case, the genetic material from a given
individual has two common ancestors, where one piece of the
DNA came from one ancestor and another piece came
from a different ancestor. As the histories of the
individuals in the sample are traced back into the past,
eventually a common ancestor of all of the individuals in the
sample will emerge.
The program TREEVOLVE, version 1.32 (http://
evolve.zoo.ox.ac.uk/), by Grassly and Rambaut,
generates sequences according to the coalescent model of
evolution with recombination (Hudson 1983). This
method generates a network of relationships with
mutations, recombinations, and coalescences randomly
as
FIG. 1.Coalescent phylogeny with recombination. The network
on the left represents a phylogeny with both loci A and B: thin lines
have only A, dotted lines have only B, and thick lines have both and
result from recombination events. Trees on the right are for A and B
separately.
signed according to parameters determined by the user.
A random sequence is then produced and evolved along
this network. TREEVOLVE was used to generate
sequences with and without recombination. The
substitution model used by TREEVOLVE was implemented
using the F84 option, but with parameter settings that
emulated the Kimura (1980) model: a
transition/transversion ratio of 2, equal nucleotide frequencies, equal rates
of substitution, and a Wright-Fisher model of evolution
with no population subdivision.
Other parameter settings for our simulated data
were motivated by a recent paper by Cheung et al.
(1999), one of many demonstrating the importance of
detecting gene conversion. Cheung et al. (1999) show
that gene conversion has occurred at least once among
the ADH genes of Old World monkeys and at least once
among the ADH genes of humans. The total length of
their sequences was 1,125 bp, and the number of
sequences sampled was 6. RECOMBINE (Kuhner,
Yamato, and Felsenstein 200 (...truncated)