The Power to Detect Recombination Using the Coalescent (pdf)

Article PDF cannot be displayed. You can download it here:

https://mbe.oxfordjournals.org/content/18/7/1421.full.pdf

The Power to Detect Recombination Using the Coalescent

Mol. Biol. Evol. 0737-4038 The Power to Detect Recombination Using the Coalescent Celeste J. Brown 0 1 3 5 Ethan C. Garner 0 1 3 4 A. Keith Dunker 0 1 3 Paul Joyce 1 2 3 5 0 School of Molecular Biosciences, Washington State University 1 School of Molecular Biosciences, Washington State University , Pull- man, Washington 99164-4660 2 Department of Mathematics, University of Idaho 3 Abbreviation: LRT , likelihood ratio test 4 Present address: Department of Biochemistry and Biophysics, University of California at San Francisco 5 Division of Statistics, University of Idaho There are a wide variety of models for estimating the phylogenetic relationships among amino acid and nucleotide sequences sampled from organisms at the population, species, and kingdom levels (reviewed by Swofford et al. 1996). One of the assumptions of these models is that recombination has not occurred in the history of the sampled sequences. When recombination is present, the relationships may be illustrated by a bifurcating graph (fig. 1), which is a network of relationships between parts of all sequences (Griffith and Marjoram 1996). Most commonly used phylogeny methods do not construct such a network, and other measures, such as construction of separate phylogenies, must be taken to find relationships using these methods. Recombination may also be a problem for tandemly repeated genes where intragenic recombination, also known as gene conversion, may occur at a substantial rate (Wiuf 2000). Before investigators use techniques that assume there is no recombination, they must be convinced that recombination is not present. Even low levels of recombination may have profound effects on phylogeny reconstruction and the conclusions one draws, so using a method that has a good chance of detecting low levels of recombination is important. Statistical power analysis can be used to evaluate various methods to determine the best method. Once a researcher is convinced that a method for detecting recombination has substantial power, that method may be used with confidence. Our purpose is to show that rigorous statistical methods exist for evaluating programs that detect recombination. Power is the probability that a statistical test will reject the null hypothesis. This probability is calculated using a statistical model that includes both the null and alternative hypotheses. Since the calculation depends on the parameters of the model, power is not a single probability, but rather a series of probabilities, one for each parameter value. For tests of recombination, the parameter of interest is the recombination rate, r (recombinations per mutation per site per generation), and the null hypothesis is that there is no recombination in the history of the sample (r 5 0). The power of a test depends recombination; statistical power; gene conversion; hypothesis testing; likelihood ratio - on many factors: the level of significance or type I error, the sample size, and how far the alternative hypothesis is from the null hypothesis. The a-level, or type I error, that is chosen is the probability that the null hypothesis will be rejected when it is true. This level should not be too high, since looking for recombination that does not exist can be an aggravating task. A power analysis uses data simulated under the null hypothesis and various alternative hypotheses to determine how often the method rejects the null hypothesis for a specific type I error rate. The power should increase toward 100% as r increases; the method whose power increases the fastest is the most powerful. Other researchers have investigated the power of various methods for detecting gene conversion. PLATO is a program that detects spatial phylogenetic heterogeneity and can be used to detect recombination (Grassly and Holmes 1997). Grassly and Holmes (1997) tested the power of PLATO to detect recombination by simulating phylogenies with different topologies under a single coalescent model and then manually recombining sequences between phylogenies. They also compared the results from PLATO with a well-documented example of recombination among the argF genes of Neissera. Drouin et al. (1999) also used well-documented examples of gene conversion to test various programs for detecting gene conversion. Neither method is optimal for conducting a power analysis in which thousands to tens of thousands of data sets generated under various alternative hypotheses are needed. The purpose of this note is to increase awareness that coalescent theory can be used to make direct, statistically valid comparisons among recombination detection methods. The coalescent with recombination (Hudson 1983) is a description of ancestral relationships and can be described as follows. As one traces the ancestry of a sample into the past, three possible evolutionary events may arise: a mutation may have occurred in one of the ancestral lines, two individuals may have a common ancestor, or the genetic material in question may have arisen as a result of recombination (see fig. 1). In this last case, the genetic material from a given individual has two common ancestors, where one piece of the DNA came from one ancestor and another piece came from a different ancestor. As the histories of the individuals in the sample are traced back into the past, eventually a common ancestor of all of the individuals in the sample will emerge. The program TREEVOLVE, version 1.32 (http:// evolve.zoo.ox.ac.uk/), by Grassly and Rambaut, generates sequences according to the coalescent model of evolution with recombination (Hudson 1983). This method generates a network of relationships with mutations, recombinations, and coalescences randomly as FIG. 1.Coalescent phylogeny with recombination. The network on the left represents a phylogeny with both loci A and B: thin lines have only A, dotted lines have only B, and thick lines have both and result from recombination events. Trees on the right are for A and B separately. signed according to parameters determined by the user. A random sequence is then produced and evolved along this network. TREEVOLVE was used to generate sequences with and without recombination. The substitution model used by TREEVOLVE was implemented using the F84 option, but with parameter settings that emulated the Kimura (1980) model: a transition/transversion ratio of 2, equal nucleotide frequencies, equal rates of substitution, and a Wright-Fisher model of evolution with no population subdivision. Other parameter settings for our simulated data were motivated by a recent paper by Cheung et al. (1999), one of many demonstrating the importance of detecting gene conversion. Cheung et al. (1999) show that gene conversion has occurred at least once among the ADH genes of Old World monkeys and at least once among the ADH genes of humans. The total length of their sequences was 1,125 bp, and the number of sequences sampled was 6. RECOMBINE (Kuhner, Yamato, and Felsenstein 200 (...truncated)