phyloXML: XML for evolutionary biology and comparative genomics
0
Bioinformatics & Systems Biology, Burnham Institute for Medical Research
,
La Jolla, CA 92037
,
USA
1
School of Informatics, Indiana University
,
Bloomington, IN 47408
,
USA
Background: Evolutionary trees are central to a wide range of biological studies. In many of these studies, tree nodes and branches need to be associated (or annotated) with various attributes. For example, in studies concerned with organismal relationships, tree nodes are associated with taxonomic names, whereas tree branches have lengths and oftentimes support values. Gene trees used in comparative genomics or phylogenomics are usually annotated with taxonomic information, genome-related data, such as gene names and functional annotations, as well as events such as gene duplications, speciations, or exon shufflings, combined with information related to the evolutionary tree itself. The data standards currently used for evolutionary trees have limited capacities to incorporate such annotations of different data types. Results: We developed a XML language, named phyloXML, for describing evolutionary trees, as well as various associated data items. PhyloXML provides elements for commonly used items, such as branch lengths, support values, taxonomic names, and gene names and identifiers. By using "property" elements, phyloXML can be adapted to novel and unforeseen use cases. We also developed various software tools for reading, writing, conversion, and visualization of phyloXML formatted data. Conclusion: PhyloXML is an XML language defined by a complete schema in XSD that allows storing and exchanging the structures of evolutionary trees as well as associated data. More information about phyloXML itself, the XSD schema, as well as tools implementing and supporting phyloXML, is available at http://www.phyloxml.org.
-
Background
Information that can be interpreted in a phylogenetic
context is growing rapidly in both types and quantities, due to
the advancement of large-scale studies such as
metagenomics and phylogenomics [1,2]. Current formats for
describing evolutionary trees are becoming increasingly
inappropriate. The main limitation of present formats is
the lack of standardized means to annotate tree nodes and
branches with distinct attributes. In the case of species
trees, these attributes are taxonomic names, branch
lengths, and often (possibly multiple) support values
(such as bootstrap values or posterior probabilities). Gene
trees used in comparative genomics and phylogenomics
applications additionally require fields for gene identifiers
and potentially gene duplication events [3], whereas trees
used in phylogeographic [4] applications require fields for
<scientific_name>Octopus
scientific_name>
geographic data. While some existing formats such as
Nexus [5] or NHX (New Hampshire eXtended) [6,7] allow
describing additional information associated with
phylogenetic trees, these formats have been shown to be
problematic in the extensibility or the interoperability as a
standard. The complexity of the Nexus format has led to
different parsers that only understand a subset of the
format, and different programs that produce poorly formed
outputs (although a XML based replacement for the
Nexus format, named "NeXML", is being developed and is
expected to alleviate problems stemming from the
complexity of the Nexus format [8]). The NHX format, built as
an adhoc extension to the Newick (New Hampshire)
standard [9] has limits in the types of information it can
incorporate, since it has been developed with one primary
use case in mind - representing gene trees with inferred
gene duplication events [3]. Previous proposals for a XML
format for systematic data [10] never gained popularity,
possibly due to a lack of supporting software.
Here we describe phyloXML, a new standardized format
for phylogenetic documents that is based on the formal
language of XML [11] and which is inspired by the XML
tree representation described in [12] (this XML format is
used as output format by the "Retree" program from the
PHYLIP package [9]).
Implementation
Along with the complete schema in XSD that defines the
format of phyloXML, a number of tools have been
implemented to support the reading and writing of phyloXML.
The Java command-line tools "phyloxml_converter" can
convert existing formats (Nexus, Newick/New
Hampshire, and NHX) into phyloXML, and "decorator" helps
the users insert various data types into a phyloXML tree.
There are multiple tree-viewing programs that support the
format, including Archaeopteryx [13] (the successor to the
tree display tool ATV [7]) and TreeViewJ [14].
Furthermore, Archaeopteryx allows the user to easily convert
phyloXML to Nexus, Newick/New Hampshire, and NHX and
vice versa. So far, phyloXML support has been developed
for three open source libraries for computational
molecular biology and bioinformatics, namely BioPerl [15]
(module Bio::TreeIO::phyloxml), BioRuby (module
Bio::PhyloXML) [16], and Biopython (module
Bio.Tree.PhyloXML) [17]. The XSD schema and links to
supporting applications, together with more complex
examples of phyloXML can be found at http://www.phy
loxml.org.
Results and Discussion
PhyloXML is general, with over 20 different elements that
encompass an extensive range of information (such as
confidence values, sequence, and taxonomic data) that
could be added to phylogenies. PhyloXML is extensible,
containing legitimate grammar for user-defined contents,
while it is also easy to expand the vocabulary of the
schema without disrupting existing usage. Because the
format is defined by a XML schema, phyloXML is also easy to
validate and process. The structure of the document is
readily parsed by any existing XML parser, while
interpreting the content needs to be implemented depending on
the use case. Because of the restrictive nature of the XML
schema, unambiguous "well-formed" and "valid"
documents will facilitate greater data exchange among users
and programs that was not feasible before.
Similar to NHX, and unlike Nexus, the structure of
phyloXML is phylogeny oriented rather than character
oriented. The basic structure of a phyloXML document is a
hierarchical cluster of recursive clades. Each clade
corresponds to a node, and the set of clades that congregate at
the root compose a phylogeny. Each clade element can
also enclose nested elements that are annotations to the
containing clade. This kind of hierarchical representation
of the phylogeny and its corresponding annotations in
each level is not only intuitive, but also naturally suitable
for a description by XML. The following is an example of
a phyloXML document describing a simple gene tree with
three external nodes (for more examples, [see Additional
file 1]).
<name>Alcohol dehydrogenases</name>
<description>contains examples of commonly used
elements</description>
<speciations>1</speciations>
<scientific_name>Bacillus
scientific_name>
<name>Alcohol dehydrogenase</name>
<scientific_name>Escherichia
scienti (...truncated)