The Bio-Community Perl toolkit for microbial ecology
Florent E. Angly
1
Christopher J. Fields
0
Gene W. Tyson
1
Associate Editor: John Hancock
0
HPCBio,
Carver Biotechnology Center, Institute for Genomic Biology
, 1206 West Gregory Drive
1
Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences
, Level 5,
Molecular Biosciences Building (76), The University of Queensland
, Brisbane St Lucia,
QLD 4072, Australia
2
MC-195, Urbana,
IL 61801, USA
Summary: The development of bioinformatic solutions for microbial ecology in Perl is limited by the lack of modules to represent and manipulate microbial community profiles from amplicon and metaomics studies. Here we introduce Bio-Community, an open-source, collaborative toolkit that extends BioPerl. Bio-Community interfaces with commonly used programs using various file formats, including BIOM, and provides operations such as rarefaction and taxonomic summaries. Bio-Community will help bioinformaticians to quickly piece together custom analysis pipelines and develop novel software. Availability an implementation: Bio-Community is cross-platform Perl code available from http://search.cpan.org/dist/Bio-Community under the Perl license. A readme file describes software installation and how to contribute. Contact: Supplementary information: Supplementary data are available at Bioinformatics online The Author 2014. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
1 INTRODUCTION
Sequencing is common in most fields of biological research, and
the throughput of modern platforms is orders of magnitudes
higher than traditional Sanger sequencing (Metzker, 2010). The
BioPerl bioinformatic toolkit (Stajich et al., 2002) has attracted a
large community of users and developers and has become critical
in many sequencing projects by allowing quick code development
and interaction between programs using incompatible file
formats. In microbial ecology, sequencing is used routinely for 16S
rRNA gene amplicon surveys (Tringe and Hugenholtz, 2008),
metagenomics (Handelsman, 2004) and metatranscriptomics
(Frias-Lopez et al., 2008). Because most microorganisms remain
uncultivated (Rapp e and Giovannoni, 2003), culture-independent
molecular surveys are essential for the characterization of
environmental microbial communities. However, they require large
computational resources, novel bioinformatic tools and elaborate
pipelines. Many tools have been developed to analyze the resulting
sequence data. For example, libraries written in Python (Knight
et al., 2007) and R (Dixon, 2003; Kembel et al., 2010) provide
blocks for building bioinformatic software. QIIME (Caporaso
et al., 2010) and mothur (Schloss et al., 2009) are dedicated
packages with scripts to build complete analysis pipelines, but they use
incompatible file formats. Here, we introduce Bio-Community, a
*To whom correspondence should be addressed.
set of format-agnostic modules and scripts to parse and
manipulate taxonomic or functional microbial community profiles.
Object model
Bio-Community is a Perl object-oriented toolkit that extends
BioPerl. It is centered around the Community object, which
contains a group of entities from the same geographic area (Fig. 1).
These entities are Member objects, representing individual
genomes, genes, taxa or operational taxonomic units from amplicon
and meta-omic surveys. Member objects store attributes such as an
identifier, a taxon or a sequence and can be given weights to
account for the fact that there is no one-to-one relationship between a
sequencing read and a microbial cell. The relative abundance or
abundance rank of a Member can be calculated based on this
Members count, weight and the total count in the Community
(Fig. 2). Similarly, absolute abundance is based on total microbial
abundance in the community, quantifiable by epifluorescence
microscopy, qPCR or flow cytometry (Rinsoz et al., 2008).
Diversity metrics
Bio-Community quantifies community , and diversity
(Whittaker, 1972) using a range of metrics [reviewed by
Magurran (2004)]. The diversity of a single Community
object, diversity, is represented by metrics of richness, evenness,
dominance and indices (Supplementary Table S1). Several
Community objects can be grouped into a Meta object,
representing a metacommunity (Leibold et al., 2004). This object
provides methods to measure diversity, i.e. the collective diversity
of its communities, and diversity, i.e. their dissimilarity. The
metrics are the same as those available for diversity, whereas
those for diversity include qualitative and quantitative forms
(Supplementary Table S1).
Data input and output
Community profiles (e.g. a site-by-species table) describe the
distribution of members in biological samples. Operations to read
and write these files are handled by the IO module and are
important for exchanging data between programs using different
formats. We have implemented parsers for five common file
types (Supplementary Table S2), including the BIOM standard
(McDonald et al., 2012). Examples of these file types are given in
the t/data folder of the Bio-Community package. The parsers
automatically detect file format based on its content using the
Fig. 1. Main objects, their attributes and operation modules
Fig. 2. Relation between abundance types. Relative abundance depends
on member counts and weights, whereas absolute abundance is further
derived from a total abundance measure
Fig. 3. Vignette illustrating the use of Bio-Community to read a BIOM
community profile and report member information
FormatGuesser module, and iteratively record member
identifier, taxonomy and abundance.
Tool modules can perform operations such as community
transformation, rarefaction and taxonomic summaries (Fig. 1). Utility
scripts using these modules are available in Bio-Community
(Supplementary Table S3). They allow biologists to perform
specific operations on community profiles, but they do not form an
entire microbial analysis pipeline. These scripts can also be
regarded as examples of integration of Bio-Community into
bioinformatic scripts (Fig. 3). This integration can also leverage
external modules to rapidly develop powerful custom scripts,
e.g. Getopt::Euclid for handling command-line arguments,
BioPerl modules for reading sequences or running external
programs (e.g. BLAST) (Camacho et al., 2009) and Statistics::R for
using R libraries or visualization capabilities.
CONCLUSIONS
Bio-Community provides several file formats to interface with
popular programs and will help bioinformaticians quickly
construct custom analysis pipelines or novel software for
microbial ecology. The integration of relative and absolute abundance
with diversity metrics permits holistic microbial studies (Dinsdale
et al., 2008; Dove e (...truncated)