New approaches for reconstructing phylogenies from gene order data
Bernard M.E. Moret
0
Li-San Wang
0
Tandy Warnow
0
Stacia K. Wyman
0
0
Department of Computer Science, University of New Mexico
,
Albuquerque, NM 87131
,
USA and Department of Computer Sciences, University of Texas
,
Austin, TX 78712
,
USA
We report on new techniques we have developed for reconstructing phylogenies on whole genomes. Our mathematical techniques include new polynomial-time methods for bounding the inversion length of a candidate tree and new polynomial-time methods for estimating genomic distances which greatly improve the accuracy of neighbor-joining analyses. We demonstrate the power of these techniques through an extensive performance study based on simulating genome evolution under a wide range of model conditions. Combining these new tools with standard approaches (fast reconstruction with neighborjoining, exploration of all possible refinements of strict consensus trees, etc.) has allowed us to analyze datasets that were previously considered computationally impractical. In particular, we have conducted a complete phylogenetic analysis of a subset of the Campanulaceae family, confirming various conjectures about the relationships among members of the subset and about the principal mechanism of evolution for their chloroplast genome. We give representative results of the extensive experimentation we conducted on both real and simulated datasets in order to validate and characterize our approaches. We find that our techniques provide very accurate reconstructions of the true tree topology even when the data are generated by processes that include a significant fraction of transpositions and when the data are close to saturation. Contact: or PRIOR RESULTS
INTRODUCTION
Biologists can infer the ordering and strandedness of
genes on a chromosome, and thus represent each
chromosome by an ordering of signed genes (where the sign
indicates the strand). These gene orders can be rearranged
by evolutionary events such as inversions and
transpositions and, because they evolve slowly, give us an important
new source of data for phylogeny reconstruction. Many
biologists have already embraced this new source of data
in their phylogenetic work (Downie and Palmer, 1992;
Olmstead and Palmer, 1994; Palmer, 1992; Raubeson and
Jansen, 1992). Appropriate tools for analyzing such data
may help resolve some difficult phylogenetic
reconstruction problems. Developing such tools is thus an important
area of researchindeed, the recent DCAF symposium
was devoted to this topic, as was a workshop at DIMACS.
A natural optimization problem for phylogeny
reconstruction from gene order data is to reconstruct an
evolutionary scenario with a minimum number of the
permitted evolutionary events on the tree. This problem is
NP-hard for most criteriaeven the very simple problem
of computing the median of three genomes under such
models is NP-hard (Caprara, 1999; Peer and Shamir,
1998). All approaches to phylogeny reconstruction for
such data must therefore find ways of handling the
significant computational difficulties. Moreover, because
suboptimal solutions can yield very different evolutionary
reconstructions, exact solutions are strongly preferred
over approximate solutions (see Swofford et al. (1996)).
For some datasets (e.g., chloroplast genomes of land
plants), biologists conjecture that the only
rearrangement events that occur are inversions. In other datasets,
transpositions and inverted transpositions are viewed as
possible, but their relative preponderance with respect to
inversions is unknown, so that it is difficult to define a
suitable distance measure based on these three events.
Researchers have used breakpoint distance (number of
pairwise gene adjacencies present in one genome but
absent in the othernot a count of evolutionary events)
as an independent measure of distance between genomes
and the breakpoint phylogeny, proposed by Blanchette et
al. (Blanchette et al., 1997), is the most parsimonious tree
with respect to breakpoint distances.
We build on several major prior results.
BPAnalysis. Blanchette et al. (Blanchette et al., 1997)
proposed the breakpoint phylogeny (finding the tree with
the fewest breakpoints) and developed a reconstruction
method, (Sankoff and Blanchette, 1998),
for that purpose. Their method examines every possible
tree topology in turn and for each topology, it generates
a set of ancestral genomes so as to minimize the total
breakpoint distance in the tree. This method returns
good results, but takes exponential time: the number of
topologies is exponential and generating a set of ancestral
genomes is achieved through an unbounded iterative
process that must solve an instance of the Travelling
Salesperson Problem (TSP) for each internal node at each
iteration. And hence, the total running time is exponential
in both the number of genes and the number of genomes.
MPBE. We developed an alternate method, based on
a binary encoding of breakpoints, to take advantage of
existing parsimony software (Cosner et al., 2000b,a).
This method, called Maximum Parsimony on Binary
Encodings (MPBE), is exponential only in the number of
genomes (because the parsimony problem is NP-hard),
runs very fast in practice, but returns only candidate tree
topologies and so must make use of the labeling phase
of in order to return ancestral genomes.
(Similar approaches based on neighbor-joining suffer
from the same problem.)
GRAPPA. We reimplemented in order to
analyze our larger datasets and also to experiment with
alternate approaches. Our program, called (Moret
et al., 2001b), includes all of the features of ,
but runs about three orders of magnitude faster. As part
of the development of , we designed a new and
very fast linear-time algorithm for computing inversion
distances (Bader et al., 2000), which has enabled us to
extend our work on breakpoint phylogeny to the inversion
phylogeny.
IEBP. We developed a mathematical technique for
estimating the maximum likelihood evolutionary distance
between two genomes (Wang and Warnow, 2001). This
technique, called IEBP for Inverting Expected
Breakpoint Distances, has provable error bounds and performs
well empirically. Furthermore, using IEBP distances for
neighbor-joining analyses results in improved estimations
of the true phylogenetic tree.
NEW RESULTS
We present several new results in this paper:
EDE, a new polynomial-time technique for estimating
evolutionary distances between genomes. EDE is not
as good an estimator as IEBP, but neighbor-joining
trees based on EDE distance estimates are more
accurate than neighbor-joining trees based on any
other distance, including IEBP distances.
A simulation study examining the relationship
between topological accuracy and two definitions of
tree length: the number of breakpoints on the tree and
the number of inversions on the tree. We find that
both definitions for tree length are correlated with
topological accuracy, with the correlation weakest for
genomes of 37 genes (the mit (...truncated)