PhyloFinder: An intelligent search engine for phylogenetic tree databases
BMC Evolutionary Biology
BioMed Central
Software
Open Access
PhyloFinder: An intelligent search engine for phylogenetic tree
databases
Duhong Chen*1, J Gordon Burleigh2, Mukul S Bansal1 and
David Fernández-Baca1
Address: 1Department of Computer Science, Iowa State University, Ames, IA 50011, USA and 2NESCent, Durham, NC 27705, USA
Email: Duhong Chen* - ; J Gordon Burleigh - ; Mukul S Bansal - ; David FernándezBaca -
* Corresponding author
Published: 21 March 2008
BMC Evolutionary Biology 2008, 8:90
doi:10.1186/1471-2148-8-90
Received: 10 September 2007
Accepted: 21 March 2008
This article is available from: http://www.biomedcentral.com/1471-2148/8/90
© 2008 Chen et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Background: Bioinformatic tools are needed to store and access the rapidly growing phylogenetic
data. These tools should enable users to identify existing phylogenetic trees containing a specified
taxon or set of taxa and to compare a specified phylogenetic hypothesis to existing phylogenetic
trees.
Results: PhyloFinder is an intelligent search engine for phylogenetic databases that we have
implemented using trees from TreeBASE. It enables taxonomic queries, in which it identifies trees
in the database containing the exact name of the query taxon and/or any synonymous taxon names,
and it provides spelling suggestions for the query when there is no match. Additionally, PhyloFinder
can identify trees containing descendants or direct ancestors of the query taxon. PhyloFinder also
performs phylogenetic queries, in which it identifies trees that contain the query tree or topologies
that are similar to the query tree.
Conclusion: PhyloFinder can enhance the utility of any tree database by providing tools for both
taxonomic and phylogenetic queries as well as visualization tools that highlight the query results
and provide links to NCBI and TBMap. An implementation of PhyloFinder using trees from
TreeBASE is available from the web client application found in the availability and requirements
section.
Background
The rapidly expanding wealth of phylogenetic information from across the tree of life offers unprecedented
opportunities for large-scale evolutionary studies and for
examining an array of biological questions in a phylogenetic context [1]. However, much of the published phylogenetic data is not easily accessible. Therefore, the storage
and efficient retrieval of phylogenetic data are important
challenges for bioinformatics [1-5]. TreeBASE is the larg-
est relational database of published phylogenetic information. It stores more than 4,400 trees that contain over
75,000 taxa, the data matrices used to infer the trees, and
additional meta-data, such as bibliographic information
and details of the phylogenetic analyses [6,7]. Though
TreeBASE is a valuable repository for phylogenetic data, it
is often difficult to identify and access relevant phylogenetic data from within TreeBASE. In this paper, we present
PhyloFinder, a new phylogenetic tree search engine that
Page 1 of 11
(page number not for citation purposes)
BMC Evolutionary Biology 2008, 8:90
greatly expands upon the current search features in TreeBASE and thus can enhance the utility of TreeBASE, or any
phylogenetic database.
To utilize the existing phylogenetic data effiectively, we
need tools that can quickly identify phylogenetic trees
containing a specified taxon or set of taxa and that can
compare a specified phylogenetic hypothesis to existing
phylogenetic trees. The complexity of taxonomy presents
a first major challenge for identifying and accessing phylogenetic data [3,4,6,7]. Taxonomic names used in stored
phylogenetic trees often are based on various inconsistent
taxonomies [6]. Furthermore, taxonomic classifications
and names frequently change, and these changes may not
be reflected in database trees. Consequently, repositories
such as TreeBASE contain many species that are represented by multiple equivalent names. Taxonomic queries
are further complicated by misspellings or unique subspecies designations in stored trees, both of which are common in TreeBASE [6]. Many of these taxonomic issues
have been addressed by TBMap, a database that maps
names of taxa found in TreeBASE to other taxonomic
databases and clusters equivalent taxonomic names [6].
However, TBMap is not incorporated in TreeBASE or in
any other phylogenetic search engines.
The hierarchical nature of taxonomic classifications
presents further challenges for accessing phylogenetic
data. The leaves in stored phylogenetic trees may represent
different taxonomic levels, such as families, genera, species, or subspecies. It should be possible for a tree database query to identify trees containing not only the
specific taxon name used in the query, but also trees containing descendants or ancestors of the query taxon [3,4].
For example, a query using the plant family name
"Pinaceae" ideally would identify not only trees that contain the exact name "Pinaceae" but also trees containing
Pinaceae genera such as "Pinus" or "Abies" or species such
as "Pinus thunbergii" or "Abies alba". It also would be useful
to identify trees containing direct ancestors (the internal
nodes on the path from the root of a taxonomy tree to the
query taxon) of the query taxon. Thus, a query on the species name "Pinus thunbergii" would identify trees that contain the genus name "Pinus" or the family name
"Pinaceae" as leaves. Currently, TreeBASE does not
directly utilize information from taxonomic classifications to allow the user to find trees containing ancestors
or descendants of the query taxon [3,4]. Instead, the user
can find all the taxa matching a partial name taxon query.
For example, querying "Pinus@" or even "Pinu@" in
TreeBASE will identify all trees containing "Pinus" in their
species name. However, querying using "Pinaceae@" will
not identify trees with "Pinus" or "Abies" species, because
they do not contain "Pinaceae" in the species name. Alternately, the user can identify trees with related taxa through
http://www.biomedcentral.com/1471-2148/8/90
"tree surfing", in which the user identifies neighboring
trees (trees with shared taxa) of a specified tree(s). Tree
surfing can be time consuming, and it is difficult if not
impossible for the user to determine if s/he has found all
the trees containing the relevant taxa.
Another important feature of an effective phylogenetic
search engine is the ability to make phylogenetic queries,
in which the user can assess a specified tree by comparing
it to the trees in the database [3,5]. Tree mining queries
must first be able to identify all trees that contain or agree
with a query tree, or the trees in the database in which the
quer (...truncated)