MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information

BMC Bioinformatics, Jun 2013

Background A central challenge to understanding the ecological and biogeochemical roles of microorganisms in natural and human engineered ecosystems is the reconstruction of metabolic interaction networks from environmental sequence information. The dominant paradigm in metabolic reconstruction is to assign functional annotations using BLAST. Functional annotations are then projected onto symbolic representations of metabolism in the form of KEGG pathways or SEED subsystems. Results Here we present MetaPathways, an open source pipeline for pathway inference that uses the PathoLogic algorithm to map functional annotations onto the MetaCyc collection of reactions and pathways, and construct environmental Pathway/Genome Databases (ePGDBs) compatible with the editing and navigation features of Pathway Tools. The pipeline accepts assembled or unassembled nucleotide sequences, performs quality assessment and control, predicts and annotates noncoding genes and open reading frames, and produces inputs to PathoLogic. In addition to constructing ePGDBs, MetaPathways uses MLTreeMap to build phylogenetic trees for selected taxonomic anchor and functional gene markers, converts General Feature Format (GFF) files into concatenated GenBank files for ePGDB construction based on third-party annotations, and generates useful file formats including Sequin files for direct GenBank submission and gene feature tables summarizing annotations, MLTreeMap trees, and ePGDB pathway coverage summaries for statistical comparisons. Conclusions MetaPathways provides users with a modular annotation and analysis pipeline for predicting metabolic interaction networks from environmental sequence information using an alternative to KEGG pathways and SEED subsystems mapping. It is extensible to genomic and transcriptomic datasets from a wide range of sequencing platforms, and generates useful data products for microbial community structure and function analysis. The MetaPathways software package, installation instructions, and example data can be obtained from http://hallam.microbiology.ubc.ca/MetaPathways.

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-14-202.pdf

MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information

BMC Bioinformatics MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information Kishori M Konwar 1 Niels W Hanson 0 Antoine P Pag 1 Steven J Hallam 0 1 0 Graduate Program in Bioinformatics, University of British Columbia , Vancouver, BC , Canada 1 Department of Microbiology & Immunology, University of British Columbia , Vancouver, BC V6T1Z3 , Canada Background: A central challenge to understanding the ecological and biogeochemical roles of microorganisms in natural and human engineered ecosystems is the reconstruction of metabolic interaction networks from environmental sequence information. The dominant paradigm in metabolic reconstruction is to assign functional annotations using BLAST. Functional annotations are then projected onto symbolic representations of metabolism in the form of KEGG pathways or SEED subsystems. Results: Here we present MetaPathways, an open source pipeline for pathway inference that uses the PathoLogic algorithm to map functional annotations onto the MetaCyc collection of reactions and pathways, and construct environmental Pathway/Genome Databases (ePGDBs) compatible with the editing and navigation features of Pathway Tools. The pipeline accepts assembled or unassembled nucleotide sequences, performs quality assessment and control, predicts and annotates noncoding genes and open reading frames, and produces inputs to PathoLogic. In addition to constructing ePGDBs, MetaPathways uses MLTreeMap to build phylogenetic trees for selected taxonomic anchor and functional gene markers, converts General Feature Format (GFF) files into concatenated GenBank files for ePGDB construction based on third-party annotations, and generates useful file formats including Sequin files for direct GenBank submission and gene feature tables summarizing annotations, MLTreeMap trees, and ePGDB pathway coverage summaries for statistical comparisons. Conclusions: MetaPathways provides users with a modular annotation and analysis pipeline for predicting metabolic interaction networks from environmental sequence information using an alternative to KEGG pathways and SEED subsystems mapping. It is extensible to genomic and transcriptomic datasets from a wide range of sequencing platforms, and generates useful data products for microbial community structure and function analysis. The MetaPathways software package, installation instructions, and example data can be obtained from http:// hallam.microbiology.ubc.ca/MetaPathways. Environmental pathway/Genome Database (ePGDB); Metagenome; Pathway tools; PathoLogic; MetaCyc; Microbial community; Metabolism; Metabolic interaction networks - Background Metabolic interactions between microorganisms direct matter and energy transformations integral to ecosystem function [1-3]. Plurality sequencing methods enable exploration of potential (metagenomic) and expressed (metatranscriptomic) metabolic interactions with the aid of computational methods that assemble or cluster contiguous reads, search for patterns or motifs representing genes, and reconstruct pathways from environmental sequence information [4-6]. The prevailing paradigm in pathway reconstruction is to assign functional annotation based on sequence homology using BLAST [7]. Functional annotations are then projected onto symbolic representations of metabolism such as KEGG pathways [8-10] or SEED subsystems [11] revealing network structure. With the expansion of next generation sequencing technologies, increasingly complex datasets are being generated for thousands of environmental samples resulting in analytic bottlenecks with the potential to stymie pathway reconstruction efforts. As a result, online services for metabolic reconstruction have been developed to externalize data processing burdens and provide warehousing and visualization tools for environmental sequence information. Popular on-line services for metabolic reconstruction include Integrated Microbial Genomes and Metagenomes (IMG/M), Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA), and Metagenome Rapid Annotation using Subsystem Technology (MG-RAST). Both IMG/M [12,13] and CAMERA [14] warehouse public datasets and provide management, exploration, and visualization tools for environmental sequence information. MG-RAST [15,16] warehouses public datasets and provides gene prediction and annotation services based on SEED subsystems mapping using FIGfams [17] and BLAST. While on-line services increase access to computational resources, idiosyncratic data processing and management practices common to each service insulate users from command-line optimization and create formatting and data transfer restrictions. Pathway Tools [18,19] is a production-quality software system that enables construction, management and navigation of symbolic representations of metabolism in the form of Pathway/Genome databases (PGDBs). A PGDB encodes contemporary knowledge about the network properties of a cellular organism. Pathway Tools supports four modular operations including metabolic pathway prediction using PathoLogic [18,20], metabolic flux modeling using MetaFlux [21], PGDB editing and navigation tools including manual or automated search functions, and comparative analysis and systems level visualizations. Further, genes, reactions, and pathways can be exported via the Systems Biology Markup Language (SMBL) framework, allowing interoperability and downstream analysis with compatible systems biology tools [22]. The Pathologic module allows users to construct new PGDBs from an annotated genome using MetaCyc [23,24], a highly curated, non-redundant and experimentally validated database of metabolic pathways representing all domains of life. Unlike KEGG pathways or SEED subsystems, MetaCyc emphasizes smaller, evolutionary conserved units of metabolism or pathway variants that are regulated and transferred together. MetaCyc is also extensively commented with pathway descriptions, literature citations, and enzyme properties including subunit composition, substrate specificity, cofactors, activators, and inhibitors each connected to specific pathway variants. A web-server version of the Pathway Tools editing and navigation tools supports on-line browsing, manual curating and web publishing of PGDBs. Currently PGDBs for 2037 cellular organisms have been constructed and incorporated into the BioCyc collection [25]. Here we extend the PGDB concept for cellular organisms to microbial community structure and function through the introduction of MetaPathways, a modular pipeline for pathway inference that uses the PathoLogic algorithm to build environmental PGDBs (ePGDBs) compatible with the editing and navigation features of Pathway Tools. The pipeline accepts assembled contig or unassembled nucleotide sequences, performs quality control and coverage estimates, predicts and annotates noncoding genes and open reading frames, and produce (...truncated)


This is a preview of a remote PDF: http://www.biomedcentral.com/content/pdf/1471-2105-14-202.pdf
Article home page: http://www.biomedcentral.com/1471-2105/14/202

Kishori M Konwar, Niels W Hanson, Antoine P Pagé, Steven J Hallam. MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information, BMC Bioinformatics, 2013, pp. 202, 14, DOI: 10.1186/1471-2105-14-202