Alignment behaviors of short peptides provide a roadmap for functional profiling of metagenomic data
Sinha et al. BMC Genomics (2015) 16:1080
DOI 10.1186/s12864-015-2272-z
RESEARCH ARTICLE
Open Access
Alignment behaviors of short peptides
provide a roadmap for functional profiling
of metagenomic data
Rohita Sinha1, Jennifer Clarke1,2,3 and Andrew K. Benson1*
Abstract
Background: Functional assignments for short-read metagenomic data pose a significant computational challenge
due to perceived unpredictability of alignment behavior and the inability to infer useful functional information from
translated protein-fragments/peptides. To address this problem, we have examined the predictability of short
peptide alignments by systematically studying alignment behavior of large sets of short peptides generated from
well-characterized proteins as well as hypothetical proteins in the KEGG database.
Results: Using test sets of peptides modeling the length and phylogenetic distributions of short-read metagenomic
data, we observed that peptides from well-characterized proteins had indistinguishable alignments to proteins from the
same orthologous family and proteins from different families. Nonetheless, the patterns contained remarkable
phylogenetic and structural signals, with alignments of even very short peptides naturally restricted to their orthologous
family and/or proteins having similar structural folds. In stark contrast, peptides from “hypothetical proteins” had only
sparse hit patterns with low frequencies and much lower identities. By weighting the structure-driven alignments and
filtering peptides with behaviors similar to those derived from “hypothetical proteins”, we demonstrate that the accuracy
of abundance predictions of protein families is dramatically improved.
Conclusions: Evolutionary processes have dispersed protein folds across multiple protein families, precluding accurate
functional assignment to short peptides, whose alignment behavior is non-random and driven by structure. Algorithms
that filter sparse peptides and weight hit patterns of peptides from “known space” dramatically improve quantification of
functions from diverse mixtures of peptides and should substantially improve applications of metagenomic analyses
requiring accurate quantitative measures of functional families.
Background
Faster and economical next-generation DNA sequencing
(NGS) technologies have enabled studies of complex microbial communities which were experimentally intractable in terms of their true microbial diversities only a
decade ago [1–6]. Economy of scale and the availability
of streamlined data processing pipelines have driven the
majority of studies’ estimates of taxonomic and phylogenetic content from 16S ribosomal RNA sequencing
and inferences of functional content from reference genomes of corresponding or related taxa. On the other
hand, whole shotgun sequencing of metagenomic DNA
* Correspondence:
1
Department of Food Science and Technology, University of Nebraska, 256
Food Innovation Complex, Lincoln, NE 68588-6205, USA
Full list of author information is available at the end of the article
arguably provides a more robust and unbiased measurement of the taxonomic and functional content of a
microbiome [7, 8], but its use has been limited due to
the necessity of greater sequencing depth (higher cost)
and significant computational challenges. The latter is
particularly acute, especially in non-human systems
where genomic catalogues and reference genomes of
representative species are not readily available. As sequencing costs continue to decline, the primary barrier
for broad application of whole shotgun metagenome sequencing is largely computational.
In silico functional annotation of proteins exploits
their evolutionary relationships with experimentally
characterized proteins and uses empirically-defined
thresholds of global sequence identity (e.g. > 40 %) to assign proteins to the same Enzyme Commission number
© 2015 Sinha et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Sinha et al. BMC Genomics (2015) 16:1080
(function) [9]. In the absence of such relationships,
methods like I-TASSER [10] and COFACTOR [11] collectively annotate some protein sequences by predicting
and comparing their structures with global and local
structural features of well-characterized reference proteins. These powerful techniques, however, have been
developed exclusively for full-length molecules, and use
of similar approaches for peptides predicted from shortread metagenomic data has generally been avoided due
to the belief that such peptides lack enough evolutionary
or structural information to accurately identify the
orthologous genes from which they originate. These
concerns are underscored by the fact that protein domains are redundantly used to perform diverse biochemical activities [12, 13], leading to the expectation that
short peptides will simply align to all the proteins carrying their “domains of origin”, resulting in a confounded
pattern of functional predictions based on a variety of
reference proteins carrying that domain [14, 15].
The three prominent resources for metagenomic data
processing (MEGAN [16], MG-RAST [17] and HUMAnN
[18]) all work similarly, aligning translated peptides from
the short reads of NGS platforms to databases of wellannotated reference proteins and using single sets of
sequence similarity measures (SSMs) for functional
prediction. The effectiveness of individual sets of SSMs
used by these protocols was recently questioned by the
finding of the PAUDA study [19], where high variances in
the identity profiles of alignment hits were observed even
within the same KEGG-orthology group (KO) [20]. These
observations resulted in concerns of significant sensitivity
losses in assigning KO-families to short NGS reads on the
basis of individual sets of SSMs. Moreover, recent publications using these metagenomic data processing methods
also demonstrate absence of any consensus among the
community of users regarding individual significance
thresholds or sets of SSMs elements that can accurately
discriminate between true and false-positive function assignments [21–26].
Given the dearth of empirically-derived data on the
alignment behavior of peptides that could even be used
to model thresholds for SSMs, we were motivated to systematically study the actual alignment behavior of short
protein fragments. Using random peptides extracted
from KO-family members (the “known” protein universe) and hypothetical uncharacterized prot (...truncated)