On the Optimization Principle in Phylogenetic Analysis and the Minimum-Evolution Criterion
Olivier Gascuel
0
0
De partement Informatique Fondamentale et Applications, LIRMM
,
Montpellier
,
France
This paper discusses the optimization principle in phylogenetic analysis, in the case of distance data. We argue that the use of this principle cannot be called into question, except for computing time reasons. We show that the minimum-evolution criterion is not perfectly suited for distance data estimated from sequences, and we present another approach, implemented in the BIONJ algorithm, which allows the data features to be taken into account, while being less demanding in computing time. Simulations show that BIONJ significantly outperforms NJ.
-
Recently, Nei, Kumar, and Takahashi (1998)
published a paper entitled The optimization principle in
phylogenetic analysis tends to give incorrect topologies
when the number of nucleotides or amino acids used is
small. They simulated the evolution of sequences
along a model tree and compared, for each replicate data
set, the observed performance of this (true) tree to that
of trees inferred by three usual phylogenetic
reconstruction approaches, based, respectively, on
maximum-parsimony (MP), minimum-evolution (ME), and
maximum-likelihood (ML) criteria. The observed
performance of a tree refers to its value for a given criterion
(MP, ME, or ML) when considering the data set at hand.
The optimization of observed performance is the basis
of most phylogenetic reconstruction approaches. For
example, in parsimony, we search (among all possible
trees) for the tree that requires the minimum number of
mutational changes to explain the evolutionary change
of the studied sequences. Note that the observed
performances of trees, and hence the inferred tree, usually
vary from one data set to another, especially with short
sequences. Moreover, since the number of possible trees
is extremely large, even for a moderate number of taxa,
it is usually not possible to use an exact algorithm, and
we rely on approximate algorithms that find
near-optimal trees within a reasonable amount of computation
time, e.g., DNAPARS (Felsenstein 1993) in the case of
the MP criterion, or NJ (Saitou and Nei 1987) in the
case of the ME criterion.
The results presented in the paper of Nei, Kumar,
and Takahashi (1998) can be summarized as follows.
When using an exact algorithm, the observed
performance of the true tree is always worse than or identical
to the performance of the optimal tree. This is not
surprising, since the true tree is one tree among all possible
trees, and therefore it cannot be better than the best tree.
Moreover, with short sequences, reconstruction
algorithms often fail to discover the true tree, and
consequently the true tree appears to be worse than the
inferred tree. With approximate algorithms, the situation
is not much different. Near-optimal inferred trees are
rarely worse than the true tree, and the topological
accuracy of these (fast) algorithms is close to that of the
exact (time-consuming) algorithms. In light of these
results, Nei, Kumar, and Takahashi (1998) suggest that
more attention should be given to testing the statistical
reliability of inferred trees, e.g., using the bootstrap
method (Felsenstein 1985), than to finding optimal trees
with excessive computational effort.
Similar results were previously described in
Gascuel (1997a, 1997b) concerning the ME criterion (see
also Kumar 1996, figs. 6 and 7). We first summarize
these results and then show that, in the case of the ME
criterion, considerations on the optimization principle
cannot fully explain the observations. We conclude that
the ME criterion is not perfectly suited for evolutionary
distance data obtained from sequences, and also that the
global optimization principle is not the sole way to
conceive, describe, and analyze phylogenetic reconstruction
algorithms. We then present another approach,
implemented in the BIONJ algorithm (Gascuel 1997a), which
follows the agglomerative scheme, as NJ, and uses a
model of distance data estimated from BIOlogical
sequences.
Simulation Schemes and NJ Algorithm Results
In Gascuel (1997a), we simulated six 12-taxon
model trees, some complying with the molecular-clock
hypothesis and the others not. The Kimura
two-parameter model was used with a transition/transversion ratio
of 2. Sequence lengths were 300 or 600 sites, and four
different evolution conditions were considered,
corresponding to low, medium, high, and high/low per-site
substitution rates. Evolutionary distances were
computed using the standard Kimura (1980) estimate, except
for the high/low per-site condition, for which we also
used the two-parameter gamma estimate (a 5 1) of Jin
and Nei (1990). Five hundred replications were
performed for each condition, involving 15,000 data sets
per sequence length. The mean results are displayed in
table 1.
With a sequence length of 300, we observed that
NJ trees were better in the ME sense (i.e., shorter) than
the true tree in 61% of the cases, and longer in 11%.
With 600 sites, we found 26% and 8%, respectively.
This indicates that the optimal ME tree (not computable
Mean Results
Sequence data . . . . . . . .
i.i.d. normal data . . . . .
NJ 1 FTS
ME CRITERION
TOPOLOGICAL DISTANCE
with 12 taxa) missed the true tree in at least 61% of the
cases with 300 sites, and in at least 26% with 600 sites.
Moreover, this demonstrates that the ME tree cannot
markedly improve the NJ tree (in terms of probability
of finding the true tree): at most, 11% improvement with
300 sites and 8% improvement with 600 sites. More
generally, this indicates that NJ selects a tree in a region
where all trees are more or less equivalent and not
discernable from the true tree using the ME criterion
(Kumar 1996, fig. 6). Therefore, this basically explains the
fact observed by several authors (Saitou and Imanishi
1989; Kumar 1996) that the ME tree is not more
accurate than the NJ tree.
In Gascuel (1997b), we simulated another type of
data. Let D 5 (dij) be a tree distance (i.e., a distance
represented by a tree with positive branch lengths),
where dij is the distance between taxa i and j. A noise
eij was added to every dij to obtain the noisy distance
(dij) 5 (dij 1 eij). The noises eij were independently and
identically distributed (i.i.d.) and normal with variance
s2 and a null expectation. Such i.i.d. normal data are
encountered when dij estimates are the result of real
observations with measurement errors, which is close, for
example, to DNA-DNA hybridization data (Felsenstein
1987). We considered high and low noise levels (s 5
0.6 or s 5 0.1) and simulated three 12-taxon and three
24-taxon model trees. Five hundred replications were
performed for each condition, which involves 3,000 data
sets per noise level. Then, we applied reconstruction
algorithms to the noisy matrices in order to recover the
true tree corresponding to the original tree distance. The
mean results are displayed in table 1.
The situation was basically the same as that for
(...truncated)