phyloXML: XML for evolutionary biology and comparative genomics (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-10-356.pdf

phyloXML: XML for evolutionary biology and comparative genomics

0 Bioinformatics & Systems Biology, Burnham Institute for Medical Research , La Jolla, CA 92037 , USA 1 School of Informatics, Indiana University , Bloomington, IN 47408 , USA Background: Evolutionary trees are central to a wide range of biological studies. In many of these studies, tree nodes and branches need to be associated (or annotated) with various attributes. For example, in studies concerned with organismal relationships, tree nodes are associated with taxonomic names, whereas tree branches have lengths and oftentimes support values. Gene trees used in comparative genomics or phylogenomics are usually annotated with taxonomic information, genome-related data, such as gene names and functional annotations, as well as events such as gene duplications, speciations, or exon shufflings, combined with information related to the evolutionary tree itself. The data standards currently used for evolutionary trees have limited capacities to incorporate such annotations of different data types. Results: We developed a XML language, named phyloXML, for describing evolutionary trees, as well as various associated data items. PhyloXML provides elements for commonly used items, such as branch lengths, support values, taxonomic names, and gene names and identifiers. By using "property" elements, phyloXML can be adapted to novel and unforeseen use cases. We also developed various software tools for reading, writing, conversion, and visualization of phyloXML formatted data. Conclusion: PhyloXML is an XML language defined by a complete schema in XSD that allows storing and exchanging the structures of evolutionary trees as well as associated data. More information about phyloXML itself, the XSD schema, as well as tools implementing and supporting phyloXML, is available at http://www.phyloxml.org. - Background Information that can be interpreted in a phylogenetic context is growing rapidly in both types and quantities, due to the advancement of large-scale studies such as metagenomics and phylogenomics [1,2]. Current formats for describing evolutionary trees are becoming increasingly inappropriate. The main limitation of present formats is the lack of standardized means to annotate tree nodes and branches with distinct attributes. In the case of species trees, these attributes are taxonomic names, branch lengths, and often (possibly multiple) support values (such as bootstrap values or posterior probabilities). Gene trees used in comparative genomics and phylogenomics applications additionally require fields for gene identifiers and potentially gene duplication events [3], whereas trees used in phylogeographic [4] applications require fields for <scientific_name>Octopus scientific_name> geographic data. While some existing formats such as Nexus [5] or NHX (New Hampshire eXtended) [6,7] allow describing additional information associated with phylogenetic trees, these formats have been shown to be problematic in the extensibility or the interoperability as a standard. The complexity of the Nexus format has led to different parsers that only understand a subset of the format, and different programs that produce poorly formed outputs (although a XML based replacement for the Nexus format, named "NeXML", is being developed and is expected to alleviate problems stemming from the complexity of the Nexus format [8]). The NHX format, built as an adhoc extension to the Newick (New Hampshire) standard [9] has limits in the types of information it can incorporate, since it has been developed with one primary use case in mind - representing gene trees with inferred gene duplication events [3]. Previous proposals for a XML format for systematic data [10] never gained popularity, possibly due to a lack of supporting software. Here we describe phyloXML, a new standardized format for phylogenetic documents that is based on the formal language of XML [11] and which is inspired by the XML tree representation described in [12] (this XML format is used as output format by the "Retree" program from the PHYLIP package [9]). Implementation Along with the complete schema in XSD that defines the format of phyloXML, a number of tools have been implemented to support the reading and writing of phyloXML. The Java command-line tools "phyloxml_converter" can convert existing formats (Nexus, Newick/New Hampshire, and NHX) into phyloXML, and "decorator" helps the users insert various data types into a phyloXML tree. There are multiple tree-viewing programs that support the format, including Archaeopteryx [13] (the successor to the tree display tool ATV [7]) and TreeViewJ [14]. Furthermore, Archaeopteryx allows the user to easily convert phyloXML to Nexus, Newick/New Hampshire, and NHX and vice versa. So far, phyloXML support has been developed for three open source libraries for computational molecular biology and bioinformatics, namely BioPerl [15] (module Bio::TreeIO::phyloxml), BioRuby (module Bio::PhyloXML) [16], and Biopython (module Bio.Tree.PhyloXML) [17]. The XSD schema and links to supporting applications, together with more complex examples of phyloXML can be found at http://www.phy loxml.org. Results and Discussion PhyloXML is general, with over 20 different elements that encompass an extensive range of information (such as confidence values, sequence, and taxonomic data) that could be added to phylogenies. PhyloXML is extensible, containing legitimate grammar for user-defined contents, while it is also easy to expand the vocabulary of the schema without disrupting existing usage. Because the format is defined by a XML schema, phyloXML is also easy to validate and process. The structure of the document is readily parsed by any existing XML parser, while interpreting the content needs to be implemented depending on the use case. Because of the restrictive nature of the XML schema, unambiguous "well-formed" and "valid" documents will facilitate greater data exchange among users and programs that was not feasible before. Similar to NHX, and unlike Nexus, the structure of phyloXML is phylogeny oriented rather than character oriented. The basic structure of a phyloXML document is a hierarchical cluster of recursive clades. Each clade corresponds to a node, and the set of clades that congregate at the root compose a phylogeny. Each clade element can also enclose nested elements that are annotations to the containing clade. This kind of hierarchical representation of the phylogeny and its corresponding annotations in each level is not only intuitive, but also naturally suitable for a description by XML. The following is an example of a phyloXML document describing a simple gene tree with three external nodes (for more examples, [see Additional file 1]). <name>Alcohol dehydrogenases</name> <description>contains examples of commonly used elements</description> <speciations>1</speciations> <scientific_name>Bacillus scientific_name> <name>Alcohol dehydrogenase</name> <scientific_name>Escherichia scienti (...truncated)