On the Optimization Principle in Phylogenetic Analysis and the Minimum-Evolution Criterion (pdf)

Article PDF cannot be displayed. You can download it here:

https://mbe.oxfordjournals.org/content/17/3/401.full.pdf

On the Optimization Principle in Phylogenetic Analysis and the Minimum-Evolution Criterion

Olivier Gascuel 0 0 De partement Informatique Fondamentale et Applications, LIRMM , Montpellier , France This paper discusses the optimization principle in phylogenetic analysis, in the case of distance data. We argue that the use of this principle cannot be called into question, except for computing time reasons. We show that the minimum-evolution criterion is not perfectly suited for distance data estimated from sequences, and we present another approach, implemented in the BIONJ algorithm, which allows the data features to be taken into account, while being less demanding in computing time. Simulations show that BIONJ significantly outperforms NJ. - Recently, Nei, Kumar, and Takahashi (1998) published a paper entitled The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. They simulated the evolution of sequences along a model tree and compared, for each replicate data set, the observed performance of this (true) tree to that of trees inferred by three usual phylogenetic reconstruction approaches, based, respectively, on maximum-parsimony (MP), minimum-evolution (ME), and maximum-likelihood (ML) criteria. The observed performance of a tree refers to its value for a given criterion (MP, ME, or ML) when considering the data set at hand. The optimization of observed performance is the basis of most phylogenetic reconstruction approaches. For example, in parsimony, we search (among all possible trees) for the tree that requires the minimum number of mutational changes to explain the evolutionary change of the studied sequences. Note that the observed performances of trees, and hence the inferred tree, usually vary from one data set to another, especially with short sequences. Moreover, since the number of possible trees is extremely large, even for a moderate number of taxa, it is usually not possible to use an exact algorithm, and we rely on approximate algorithms that find near-optimal trees within a reasonable amount of computation time, e.g., DNAPARS (Felsenstein 1993) in the case of the MP criterion, or NJ (Saitou and Nei 1987) in the case of the ME criterion. The results presented in the paper of Nei, Kumar, and Takahashi (1998) can be summarized as follows. When using an exact algorithm, the observed performance of the true tree is always worse than or identical to the performance of the optimal tree. This is not surprising, since the true tree is one tree among all possible trees, and therefore it cannot be better than the best tree. Moreover, with short sequences, reconstruction algorithms often fail to discover the true tree, and consequently the true tree appears to be worse than the inferred tree. With approximate algorithms, the situation is not much different. Near-optimal inferred trees are rarely worse than the true tree, and the topological accuracy of these (fast) algorithms is close to that of the exact (time-consuming) algorithms. In light of these results, Nei, Kumar, and Takahashi (1998) suggest that more attention should be given to testing the statistical reliability of inferred trees, e.g., using the bootstrap method (Felsenstein 1985), than to finding optimal trees with excessive computational effort. Similar results were previously described in Gascuel (1997a, 1997b) concerning the ME criterion (see also Kumar 1996, figs. 6 and 7). We first summarize these results and then show that, in the case of the ME criterion, considerations on the optimization principle cannot fully explain the observations. We conclude that the ME criterion is not perfectly suited for evolutionary distance data obtained from sequences, and also that the global optimization principle is not the sole way to conceive, describe, and analyze phylogenetic reconstruction algorithms. We then present another approach, implemented in the BIONJ algorithm (Gascuel 1997a), which follows the agglomerative scheme, as NJ, and uses a model of distance data estimated from BIOlogical sequences. Simulation Schemes and NJ Algorithm Results In Gascuel (1997a), we simulated six 12-taxon model trees, some complying with the molecular-clock hypothesis and the others not. The Kimura two-parameter model was used with a transition/transversion ratio of 2. Sequence lengths were 300 or 600 sites, and four different evolution conditions were considered, corresponding to low, medium, high, and high/low per-site substitution rates. Evolutionary distances were computed using the standard Kimura (1980) estimate, except for the high/low per-site condition, for which we also used the two-parameter gamma estimate (a 5 1) of Jin and Nei (1990). Five hundred replications were performed for each condition, involving 15,000 data sets per sequence length. The mean results are displayed in table 1. With a sequence length of 300, we observed that NJ trees were better in the ME sense (i.e., shorter) than the true tree in 61% of the cases, and longer in 11%. With 600 sites, we found 26% and 8%, respectively. This indicates that the optimal ME tree (not computable Mean Results Sequence data . . . . . . . . i.i.d. normal data . . . . . NJ 1 FTS ME CRITERION TOPOLOGICAL DISTANCE with 12 taxa) missed the true tree in at least 61% of the cases with 300 sites, and in at least 26% with 600 sites. Moreover, this demonstrates that the ME tree cannot markedly improve the NJ tree (in terms of probability of finding the true tree): at most, 11% improvement with 300 sites and 8% improvement with 600 sites. More generally, this indicates that NJ selects a tree in a region where all trees are more or less equivalent and not discernable from the true tree using the ME criterion (Kumar 1996, fig. 6). Therefore, this basically explains the fact observed by several authors (Saitou and Imanishi 1989; Kumar 1996) that the ME tree is not more accurate than the NJ tree. In Gascuel (1997b), we simulated another type of data. Let D 5 (dij) be a tree distance (i.e., a distance represented by a tree with positive branch lengths), where dij is the distance between taxa i and j. A noise eij was added to every dij to obtain the noisy distance (dij) 5 (dij 1 eij). The noises eij were independently and identically distributed (i.i.d.) and normal with variance s2 and a null expectation. Such i.i.d. normal data are encountered when dij estimates are the result of real observations with measurement errors, which is close, for example, to DNA-DNA hybridization data (Felsenstein 1987). We considered high and low noise levels (s 5 0.6 or s 5 0.1) and simulated three 12-taxon and three 24-taxon model trees. Five hundred replications were performed for each condition, which involves 3,000 data sets per noise level. Then, we applied reconstruction algorithms to the noisy matrices in order to recover the true tree corresponding to the original tree distance. The mean results are displayed in table 1. The situation was basically the same as that for (...truncated)