MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information
BMC Bioinformatics
MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information
Kishori M Konwar 1
Niels W Hanson 0
Antoine P Pag 1
Steven J Hallam 0 1
0 Graduate Program in Bioinformatics, University of British Columbia , Vancouver, BC , Canada
1 Department of Microbiology & Immunology, University of British Columbia , Vancouver, BC V6T1Z3 , Canada
Background: A central challenge to understanding the ecological and biogeochemical roles of microorganisms in natural and human engineered ecosystems is the reconstruction of metabolic interaction networks from environmental sequence information. The dominant paradigm in metabolic reconstruction is to assign functional annotations using BLAST. Functional annotations are then projected onto symbolic representations of metabolism in the form of KEGG pathways or SEED subsystems. Results: Here we present MetaPathways, an open source pipeline for pathway inference that uses the PathoLogic algorithm to map functional annotations onto the MetaCyc collection of reactions and pathways, and construct environmental Pathway/Genome Databases (ePGDBs) compatible with the editing and navigation features of Pathway Tools. The pipeline accepts assembled or unassembled nucleotide sequences, performs quality assessment and control, predicts and annotates noncoding genes and open reading frames, and produces inputs to PathoLogic. In addition to constructing ePGDBs, MetaPathways uses MLTreeMap to build phylogenetic trees for selected taxonomic anchor and functional gene markers, converts General Feature Format (GFF) files into concatenated GenBank files for ePGDB construction based on third-party annotations, and generates useful file formats including Sequin files for direct GenBank submission and gene feature tables summarizing annotations, MLTreeMap trees, and ePGDB pathway coverage summaries for statistical comparisons. Conclusions: MetaPathways provides users with a modular annotation and analysis pipeline for predicting metabolic interaction networks from environmental sequence information using an alternative to KEGG pathways and SEED subsystems mapping. It is extensible to genomic and transcriptomic datasets from a wide range of sequencing platforms, and generates useful data products for microbial community structure and function analysis. The MetaPathways software package, installation instructions, and example data can be obtained from http:// hallam.microbiology.ubc.ca/MetaPathways.
Environmental pathway/Genome Database (ePGDB); Metagenome; Pathway tools; PathoLogic; MetaCyc; Microbial community; Metabolism; Metabolic interaction networks
-
Background
Metabolic interactions between microorganisms direct
matter and energy transformations integral to ecosystem
function [1-3]. Plurality sequencing methods enable
exploration of potential (metagenomic) and expressed
(metatranscriptomic) metabolic interactions with the aid
of computational methods that assemble or cluster
contiguous reads, search for patterns or motifs representing
genes, and reconstruct pathways from environmental
sequence information [4-6]. The prevailing paradigm in
pathway reconstruction is to assign functional
annotation based on sequence homology using BLAST [7].
Functional annotations are then projected onto
symbolic representations of metabolism such as KEGG
pathways [8-10] or SEED subsystems [11] revealing
network structure.
With the expansion of next generation sequencing
technologies, increasingly complex datasets are being
generated for thousands of environmental samples
resulting in analytic bottlenecks with the potential to
stymie pathway reconstruction efforts. As a result,
online services for metabolic reconstruction have been
developed to externalize data processing burdens and
provide warehousing and visualization tools for
environmental sequence information. Popular on-line services
for metabolic reconstruction include Integrated
Microbial Genomes and Metagenomes (IMG/M), Community
Cyberinfrastructure for Advanced Microbial Ecology
Research and Analysis (CAMERA), and Metagenome Rapid
Annotation using Subsystem Technology (MG-RAST).
Both IMG/M [12,13] and CAMERA [14] warehouse
public datasets and provide management, exploration,
and visualization tools for environmental sequence
information. MG-RAST [15,16] warehouses public
datasets and provides gene prediction and annotation
services based on SEED subsystems mapping using
FIGfams [17] and BLAST. While on-line services
increase access to computational resources, idiosyncratic
data processing and management practices common
to each service insulate users from command-line
optimization and create formatting and data transfer
restrictions.
Pathway Tools [18,19] is a production-quality
software system that enables construction, management
and navigation of symbolic representations of
metabolism in the form of Pathway/Genome databases
(PGDBs). A PGDB encodes contemporary knowledge
about the network properties of a cellular organism.
Pathway Tools supports four modular operations
including metabolic pathway prediction using
PathoLogic [18,20], metabolic flux modeling using MetaFlux
[21], PGDB editing and navigation tools including
manual or automated search functions, and
comparative analysis and systems level visualizations. Further,
genes, reactions, and pathways can be exported via
the Systems Biology Markup Language (SMBL)
framework, allowing interoperability and downstream
analysis with compatible systems biology tools [22]. The
Pathologic module allows users to construct new
PGDBs from an annotated genome using MetaCyc
[23,24], a highly curated, non-redundant and
experimentally validated database of metabolic pathways
representing all domains of life. Unlike KEGG pathways
or SEED subsystems, MetaCyc emphasizes smaller,
evolutionary conserved units of metabolism or pathway
variants that are regulated and transferred together.
MetaCyc is also extensively commented with pathway
descriptions, literature citations, and enzyme properties
including subunit composition, substrate specificity,
cofactors, activators, and inhibitors each connected to
specific pathway variants. A web-server version of the
Pathway Tools editing and navigation tools supports
on-line browsing, manual curating and web publishing
of PGDBs. Currently PGDBs for 2037 cellular
organisms have been constructed and incorporated into the
BioCyc collection [25].
Here we extend the PGDB concept for cellular
organisms to microbial community structure and function
through the introduction of MetaPathways, a modular
pipeline for pathway inference that uses the PathoLogic
algorithm to build environmental PGDBs (ePGDBs)
compatible with the editing and navigation features
of Pathway Tools. The pipeline accepts assembled contig
or unassembled nucleotide sequences, performs quality
control and coverage estimates, predicts and annotates
noncoding genes and open reading frames, and
produce (...truncated)