New approaches for reconstructing phylogenies from gene order data

Bioinformatics, Jun 2001

We report on new techniques we have developed for reconstructing phylogenies on whole genomes. Our mathematical techniques include new polynomial-time methods for bounding the inversion length of a candidate tree and new polynomial-time methods for estimating genomic distances which greatly improve the accuracy of neighbor-joining analyses. We demonstrate the power of these techniques through an extensive performance study based on simulating genome evolution under a wide range of model conditions. Combining these new tools with standard approaches (fast reconstruction with neighbor-joining, exploration of all possible refinements of strict consensus trees, etc.) has allowed us to analyze datasets that were previously considered computationally impractical. In particular, we have conducted a complete phylogenetic analysis of a subset of the Campanulaceae family, confirming various conjectures about the relationships among members of the subset and about the principal mechanism of evolution for their chloroplast genome. We give representative results of the extensive experimentation we conducted on both real and simulated datasets in order to validate and characterize our approaches. We find that our techniques provide very accurate reconstructions of the true tree topology even when the data are generated by processes that include a significant fraction of transpositions and when the data are close to saturation. Contact: moret{at}cs.unm.eduor tandy{at}cs.utexas.edu

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/17/suppl_1/S165.full.pdf

New approaches for reconstructing phylogenies from gene order data

Bernard M.E. Moret 0 Li-San Wang 0 Tandy Warnow 0 Stacia K. Wyman 0 0 Department of Computer Science, University of New Mexico , Albuquerque, NM 87131 , USA and Department of Computer Sciences, University of Texas , Austin, TX 78712 , USA We report on new techniques we have developed for reconstructing phylogenies on whole genomes. Our mathematical techniques include new polynomial-time methods for bounding the inversion length of a candidate tree and new polynomial-time methods for estimating genomic distances which greatly improve the accuracy of neighbor-joining analyses. We demonstrate the power of these techniques through an extensive performance study based on simulating genome evolution under a wide range of model conditions. Combining these new tools with standard approaches (fast reconstruction with neighborjoining, exploration of all possible refinements of strict consensus trees, etc.) has allowed us to analyze datasets that were previously considered computationally impractical. In particular, we have conducted a complete phylogenetic analysis of a subset of the Campanulaceae family, confirming various conjectures about the relationships among members of the subset and about the principal mechanism of evolution for their chloroplast genome. We give representative results of the extensive experimentation we conducted on both real and simulated datasets in order to validate and characterize our approaches. We find that our techniques provide very accurate reconstructions of the true tree topology even when the data are generated by processes that include a significant fraction of transpositions and when the data are close to saturation. Contact: or PRIOR RESULTS INTRODUCTION Biologists can infer the ordering and strandedness of genes on a chromosome, and thus represent each chromosome by an ordering of signed genes (where the sign indicates the strand). These gene orders can be rearranged by evolutionary events such as inversions and transpositions and, because they evolve slowly, give us an important new source of data for phylogeny reconstruction. Many biologists have already embraced this new source of data in their phylogenetic work (Downie and Palmer, 1992; Olmstead and Palmer, 1994; Palmer, 1992; Raubeson and Jansen, 1992). Appropriate tools for analyzing such data may help resolve some difficult phylogenetic reconstruction problems. Developing such tools is thus an important area of researchindeed, the recent DCAF symposium was devoted to this topic, as was a workshop at DIMACS. A natural optimization problem for phylogeny reconstruction from gene order data is to reconstruct an evolutionary scenario with a minimum number of the permitted evolutionary events on the tree. This problem is NP-hard for most criteriaeven the very simple problem of computing the median of three genomes under such models is NP-hard (Caprara, 1999; Peer and Shamir, 1998). All approaches to phylogeny reconstruction for such data must therefore find ways of handling the significant computational difficulties. Moreover, because suboptimal solutions can yield very different evolutionary reconstructions, exact solutions are strongly preferred over approximate solutions (see Swofford et al. (1996)). For some datasets (e.g., chloroplast genomes of land plants), biologists conjecture that the only rearrangement events that occur are inversions. In other datasets, transpositions and inverted transpositions are viewed as possible, but their relative preponderance with respect to inversions is unknown, so that it is difficult to define a suitable distance measure based on these three events. Researchers have used breakpoint distance (number of pairwise gene adjacencies present in one genome but absent in the othernot a count of evolutionary events) as an independent measure of distance between genomes and the breakpoint phylogeny, proposed by Blanchette et al. (Blanchette et al., 1997), is the most parsimonious tree with respect to breakpoint distances. We build on several major prior results. BPAnalysis. Blanchette et al. (Blanchette et al., 1997) proposed the breakpoint phylogeny (finding the tree with the fewest breakpoints) and developed a reconstruction method, (Sankoff and Blanchette, 1998), for that purpose. Their method examines every possible tree topology in turn and for each topology, it generates a set of ancestral genomes so as to minimize the total breakpoint distance in the tree. This method returns good results, but takes exponential time: the number of topologies is exponential and generating a set of ancestral genomes is achieved through an unbounded iterative process that must solve an instance of the Travelling Salesperson Problem (TSP) for each internal node at each iteration. And hence, the total running time is exponential in both the number of genes and the number of genomes. MPBE. We developed an alternate method, based on a binary encoding of breakpoints, to take advantage of existing parsimony software (Cosner et al., 2000b,a). This method, called Maximum Parsimony on Binary Encodings (MPBE), is exponential only in the number of genomes (because the parsimony problem is NP-hard), runs very fast in practice, but returns only candidate tree topologies and so must make use of the labeling phase of in order to return ancestral genomes. (Similar approaches based on neighbor-joining suffer from the same problem.) GRAPPA. We reimplemented in order to analyze our larger datasets and also to experiment with alternate approaches. Our program, called (Moret et al., 2001b), includes all of the features of , but runs about three orders of magnitude faster. As part of the development of , we designed a new and very fast linear-time algorithm for computing inversion distances (Bader et al., 2000), which has enabled us to extend our work on breakpoint phylogeny to the inversion phylogeny. IEBP. We developed a mathematical technique for estimating the maximum likelihood evolutionary distance between two genomes (Wang and Warnow, 2001). This technique, called IEBP for Inverting Expected Breakpoint Distances, has provable error bounds and performs well empirically. Furthermore, using IEBP distances for neighbor-joining analyses results in improved estimations of the true phylogenetic tree. NEW RESULTS We present several new results in this paper: EDE, a new polynomial-time technique for estimating evolutionary distances between genomes. EDE is not as good an estimator as IEBP, but neighbor-joining trees based on EDE distance estimates are more accurate than neighbor-joining trees based on any other distance, including IEBP distances. A simulation study examining the relationship between topological accuracy and two definitions of tree length: the number of breakpoints on the tree and the number of inversions on the tree. We find that both definitions for tree length are correlated with topological accuracy, with the correlation weakest for genomes of 37 genes (the mit (...truncated)


This is a preview of a remote PDF: https://bioinformatics.oxfordjournals.org/content/17/suppl_1/S165.full.pdf
Article home page: http://bioinformatics.oxfordjournals.org/content/17/suppl_1/S165.abstract

Bernard M.E. Moret, Li-San Wang, Tandy Warnow, Stacia K. Wyman. New approaches for reconstructing phylogenies from gene order data, Bioinformatics, 2001, pp. S165-S173, 17/suppl 1, DOI: 10.1093/bioinformatics/17.suppl_1.S165