BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data.
BIONJ: An Improved Version of the NJ Algorithm Based on a Simple
Model of Sequence Data
Olivier Gascuel
GERAD,
Ecole des HEC, Montreal,
and Departement
d’Informatique
Fondamentale,
LIRMM,
Montpellier
Introduction
The neighbor-joining
(NJ) algorithm of Saitou and
Nei (1987) is one of the most popular methods for reconstructing
phylogenetic
trees from a matrix of pairwise evolutionary
distances. This algorithm follows an
agglomerative
scheme which was first proposed in the
context of mathematical
psychology
by Sattath and
Tversky ( 1977). Agglomerative
algorithms
iteratively
pick a pair of taxa, create a new node which represents
the cluster of these taxa, and reduce the distance matrix
by replacing both taxa by this node. This cycle is repeated until only three taxa remain. To agglomerate
pairs of nodes, NJ follows the minimum-evolution
(ME)
principle, which was first suggested by Kidd and Sgaramella-Zonta
(1971) and which consists of choosing the
tree with the smallest sum of branch lengths. Rzhetsky
and Nei (1993) have shown that this principle has a
sound theoretical foundation
when the lengths are obtained by the ordinary least-squares
method and when
an unbiased estimate of evolutionary
distances is used.
Under these assumptions,
the true tree has the smallest
expected length among all possible trees. However, this
result describes an expected (or average) behavior, and
it is not applicable to every particular data set. Moreover, due to its greedy, agglomerative
approach, NJ does
not usually find the ME tree, but only a short tree whose
topology is generally similar to that of the ME tree (Saitou and Imanishi
1989). This does not preclude good
Key words: phylogeny, neighbor-joining,
distance method, model
of data, variances and covariances of distance estimates.
Address for correspondence
and reprints before July 31, 1997:
Olivier Gascuel, GERAD, Ecole des HEC, 3CKKtchemin de la C&eSainte-Catherine,
Montreal, Quebec, Canada H3T 2A7. E-mail: .
Address as of July 31, 1997: Departement
d’Informatique
Fondamentale,
LIRMM, 161 rue Ada, 34392, Montpellier, France. E-mail: .
Mol. Biol. Evol. 14(7):685-695.
1997
0 1997 by the Society for Molecular Biology
and Evolution.
ISSN: 0737-4038
We propose an improved version of the neighbor-joining
(NJ) algorithm of Saitou and Nei. This new algorithm,
BIONJ, follows the same agglomerative scheme as NJ, which consists of iteratively picking a pair of taxa, creating
a new node which represents the cluster of these taxa, and reducing the distance matrix by replacing both taxa by
this node. Moreover, BIONJ uses a simple first-order model of the variances and covariances of evolutionary distance
estimates. This model is well adapted when these estimates are obtained from aligned sequences. At each step it
permits the selection, from the class of admissible reductions, of the reduction which minimizes the variance of the
new distance matrix. In this way, we obtain better estimates to choose the pair of taxa to be agglomerated during
the next steps. Moreover, in comparison with NJ’s estimates, these estimates become better and better as the
algorithm proceeds. BIONJ retains the good properties of NJ-especially
its low run time. Computer simulations
have been performed with 12-taxon model trees to determine BIONJ’s efficiency. When the substitution rates are
low (maximum pairwise divergence ~0.1 substitutions per site) or when they are constant among lineages, BIONJ
is only slightly better than NJ. When the substitution rates are higher and vary among lineages, BIONJ clearly has
better topological accuracy. In the latter case, for the model trees and the conditions of evolution tested, the
topological error reduction is on the average around 20%. With highly-varying-rate
trees and with high substitution
rates (maximum pairwise divergence = 1.0 substitutions per site), the error reduction may even rise above 50%,
while the probability of finding the correct tree may be augmented by as much as 15%.
performance,
since for any particular data set, the true
tree itself is usually close to but not identical with the
ME tree, and numerous computer simulations
(Saitou
and Nei 1987; Nei 1991; Charleston, Hendy, and Penny
1994; Kuhner and Felsenstein
1994) have shown the
high relative efficiency of the NJ method in recovering
the true topology. Following these authors, NJ seems to
be one of the very best distance methods. It is more
reliable
than the maximum-parsimony
approaches,
which are sometimes asymptotically
inconsistent,
and it
is just slightly weaker than the maximum-likelihood
methods, especially when the molecular clock hypothesis is clearly violated, probably because it does not take
adequate account of the model of sequence evolution.
Moreover, the NJ algorithm, as formulated by Studier
and Keppler (1988), is efficient from a computational
point of view and has an O(n3) time complexity, where
n is the number of taxa. Also, theoretical studies (Atteson 1996) have shown that NJ is in some sense as
efficient as possible.
Several attempts have been made to improve the
NJ algorithm by designing methods able to find trees
very close to or identical with the ME tree. Saitou and
Imanishi (1989) proposed an exhaustive search method
which applies when the number of taxa is small (n <
10). Rzhetsky and Nei (1993) designed various strategies to search for the ME tree in the neighborhood
of
the NJ tree by conducting local rearrangements.
These
authors also suggested that alternative topologies could
be generated using a bootstrap procedure (Rzhetsky and
Nei 1994). Finally, Kumar (1996) designed efficient
heuristics for searching the tree space in a more or less
exhaustive manner. These methods have the ability to
produce a set of short trees, which provide more information than the single NJ tree. Moreover, they usually
find trees shorter than the NJ tree. But, unfortunately,
computer simulations (Saitou and Imanishi 1989; Kumar
685
686
Gascuel
1996) indicate that the ability to recover the true topology is not increased and that NJ may hardly be outstripped in this way.
This paper proposes a different approach. Instead
of trying to find trees shorter than NJ trees, we reconsider the basic principle of the NJ algorithm. We show
that some mathematical
formulae employed by NJ may
be improved by taking into account the features of biological data. This new version, which we call BIONJ,
is basically intended to deal with evolutionary
distances
obtained from aligned sequences. In what follows, we
first describe this new algorithm, then provide computer
simulations to demonstrate its efficiency.
The BIONJ Algorithm
Notation
and Background
In what follows, we use the simplified expression
of NJ from Studier and Keppler (1988), equivalent to
the original (Gascuel 1994). NJ uses an agglomerative
approach. At each step, it has a distance matrix (6,)
where i and j are taxa, or clusters of original taxa agglomerated during the previous ste (...truncated)