BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. (pdf)

Article PDF cannot be displayed. You can download it here:

https://mbe.oxfordjournals.org/content/14/7/685.full.pdf

BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data.

BIONJ: An Improved Version of the NJ Algorithm Based on a Simple Model of Sequence Data Olivier Gascuel GERAD, Ecole des HEC, Montreal, and Departement d’Informatique Fondamentale, LIRMM, Montpellier Introduction The neighbor-joining (NJ) algorithm of Saitou and Nei (1987) is one of the most popular methods for reconstructing phylogenetic trees from a matrix of pairwise evolutionary distances. This algorithm follows an agglomerative scheme which was first proposed in the context of mathematical psychology by Sattath and Tversky ( 1977). Agglomerative algorithms iteratively pick a pair of taxa, create a new node which represents the cluster of these taxa, and reduce the distance matrix by replacing both taxa by this node. This cycle is repeated until only three taxa remain. To agglomerate pairs of nodes, NJ follows the minimum-evolution (ME) principle, which was first suggested by Kidd and Sgaramella-Zonta (1971) and which consists of choosing the tree with the smallest sum of branch lengths. Rzhetsky and Nei (1993) have shown that this principle has a sound theoretical foundation when the lengths are obtained by the ordinary least-squares method and when an unbiased estimate of evolutionary distances is used. Under these assumptions, the true tree has the smallest expected length among all possible trees. However, this result describes an expected (or average) behavior, and it is not applicable to every particular data set. Moreover, due to its greedy, agglomerative approach, NJ does not usually find the ME tree, but only a short tree whose topology is generally similar to that of the ME tree (Saitou and Imanishi 1989). This does not preclude good Key words: phylogeny, neighbor-joining, distance method, model of data, variances and covariances of distance estimates. Address for correspondence and reprints before July 31, 1997: Olivier Gascuel, GERAD, Ecole des HEC, 3CKKtchemin de la C&eSainte-Catherine, Montreal, Quebec, Canada H3T 2A7. E-mail: . Address as of July 31, 1997: Departement d’Informatique Fondamentale, LIRMM, 161 rue Ada, 34392, Montpellier, France. E-mail: . Mol. Biol. Evol. 14(7):685-695. 1997 0 1997 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 We propose an improved version of the neighbor-joining (NJ) algorithm of Saitou and Nei. This new algorithm, BIONJ, follows the same agglomerative scheme as NJ, which consists of iteratively picking a pair of taxa, creating a new node which represents the cluster of these taxa, and reducing the distance matrix by replacing both taxa by this node. Moreover, BIONJ uses a simple first-order model of the variances and covariances of evolutionary distance estimates. This model is well adapted when these estimates are obtained from aligned sequences. At each step it permits the selection, from the class of admissible reductions, of the reduction which minimizes the variance of the new distance matrix. In this way, we obtain better estimates to choose the pair of taxa to be agglomerated during the next steps. Moreover, in comparison with NJ’s estimates, these estimates become better and better as the algorithm proceeds. BIONJ retains the good properties of NJ-especially its low run time. Computer simulations have been performed with 12-taxon model trees to determine BIONJ’s efficiency. When the substitution rates are low (maximum pairwise divergence ~0.1 substitutions per site) or when they are constant among lineages, BIONJ is only slightly better than NJ. When the substitution rates are higher and vary among lineages, BIONJ clearly has better topological accuracy. In the latter case, for the model trees and the conditions of evolution tested, the topological error reduction is on the average around 20%. With highly-varying-rate trees and with high substitution rates (maximum pairwise divergence = 1.0 substitutions per site), the error reduction may even rise above 50%, while the probability of finding the correct tree may be augmented by as much as 15%. performance, since for any particular data set, the true tree itself is usually close to but not identical with the ME tree, and numerous computer simulations (Saitou and Nei 1987; Nei 1991; Charleston, Hendy, and Penny 1994; Kuhner and Felsenstein 1994) have shown the high relative efficiency of the NJ method in recovering the true topology. Following these authors, NJ seems to be one of the very best distance methods. It is more reliable than the maximum-parsimony approaches, which are sometimes asymptotically inconsistent, and it is just slightly weaker than the maximum-likelihood methods, especially when the molecular clock hypothesis is clearly violated, probably because it does not take adequate account of the model of sequence evolution. Moreover, the NJ algorithm, as formulated by Studier and Keppler (1988), is efficient from a computational point of view and has an O(n3) time complexity, where n is the number of taxa. Also, theoretical studies (Atteson 1996) have shown that NJ is in some sense as efficient as possible. Several attempts have been made to improve the NJ algorithm by designing methods able to find trees very close to or identical with the ME tree. Saitou and Imanishi (1989) proposed an exhaustive search method which applies when the number of taxa is small (n < 10). Rzhetsky and Nei (1993) designed various strategies to search for the ME tree in the neighborhood of the NJ tree by conducting local rearrangements. These authors also suggested that alternative topologies could be generated using a bootstrap procedure (Rzhetsky and Nei 1994). Finally, Kumar (1996) designed efficient heuristics for searching the tree space in a more or less exhaustive manner. These methods have the ability to produce a set of short trees, which provide more information than the single NJ tree. Moreover, they usually find trees shorter than the NJ tree. But, unfortunately, computer simulations (Saitou and Imanishi 1989; Kumar 685 686 Gascuel 1996) indicate that the ability to recover the true topology is not increased and that NJ may hardly be outstripped in this way. This paper proposes a different approach. Instead of trying to find trees shorter than NJ trees, we reconsider the basic principle of the NJ algorithm. We show that some mathematical formulae employed by NJ may be improved by taking into account the features of biological data. This new version, which we call BIONJ, is basically intended to deal with evolutionary distances obtained from aligned sequences. In what follows, we first describe this new algorithm, then provide computer simulations to demonstrate its efficiency. The BIONJ Algorithm Notation and Background In what follows, we use the simplified expression of NJ from Studier and Keppler (1988), equivalent to the original (Gascuel 1994). NJ uses an agglomerative approach. At each step, it has a distance matrix (6,) where i and j are taxa, or clusters of original taxa agglomerated during the previous ste (...truncated)