Alignment behaviors of short peptides provide a roadmap for functional profiling of metagenomic data (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/s12864-015-2272-z.pdf

Alignment behaviors of short peptides provide a roadmap for functional profiling of metagenomic data

Sinha et al. BMC Genomics (2015) 16:1080 DOI 10.1186/s12864-015-2272-z RESEARCH ARTICLE Open Access Alignment behaviors of short peptides provide a roadmap for functional profiling of metagenomic data Rohita Sinha1, Jennifer Clarke1,2,3 and Andrew K. Benson1* Abstract Background: Functional assignments for short-read metagenomic data pose a significant computational challenge due to perceived unpredictability of alignment behavior and the inability to infer useful functional information from translated protein-fragments/peptides. To address this problem, we have examined the predictability of short peptide alignments by systematically studying alignment behavior of large sets of short peptides generated from well-characterized proteins as well as hypothetical proteins in the KEGG database. Results: Using test sets of peptides modeling the length and phylogenetic distributions of short-read metagenomic data, we observed that peptides from well-characterized proteins had indistinguishable alignments to proteins from the same orthologous family and proteins from different families. Nonetheless, the patterns contained remarkable phylogenetic and structural signals, with alignments of even very short peptides naturally restricted to their orthologous family and/or proteins having similar structural folds. In stark contrast, peptides from “hypothetical proteins” had only sparse hit patterns with low frequencies and much lower identities. By weighting the structure-driven alignments and filtering peptides with behaviors similar to those derived from “hypothetical proteins”, we demonstrate that the accuracy of abundance predictions of protein families is dramatically improved. Conclusions: Evolutionary processes have dispersed protein folds across multiple protein families, precluding accurate functional assignment to short peptides, whose alignment behavior is non-random and driven by structure. Algorithms that filter sparse peptides and weight hit patterns of peptides from “known space” dramatically improve quantification of functions from diverse mixtures of peptides and should substantially improve applications of metagenomic analyses requiring accurate quantitative measures of functional families. Background Faster and economical next-generation DNA sequencing (NGS) technologies have enabled studies of complex microbial communities which were experimentally intractable in terms of their true microbial diversities only a decade ago [1–6]. Economy of scale and the availability of streamlined data processing pipelines have driven the majority of studies’ estimates of taxonomic and phylogenetic content from 16S ribosomal RNA sequencing and inferences of functional content from reference genomes of corresponding or related taxa. On the other hand, whole shotgun sequencing of metagenomic DNA * Correspondence: 1 Department of Food Science and Technology, University of Nebraska, 256 Food Innovation Complex, Lincoln, NE 68588-6205, USA Full list of author information is available at the end of the article arguably provides a more robust and unbiased measurement of the taxonomic and functional content of a microbiome [7, 8], but its use has been limited due to the necessity of greater sequencing depth (higher cost) and significant computational challenges. The latter is particularly acute, especially in non-human systems where genomic catalogues and reference genomes of representative species are not readily available. As sequencing costs continue to decline, the primary barrier for broad application of whole shotgun metagenome sequencing is largely computational. In silico functional annotation of proteins exploits their evolutionary relationships with experimentally characterized proteins and uses empirically-defined thresholds of global sequence identity (e.g. > 40 %) to assign proteins to the same Enzyme Commission number © 2015 Sinha et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Sinha et al. BMC Genomics (2015) 16:1080 (function) [9]. In the absence of such relationships, methods like I-TASSER [10] and COFACTOR [11] collectively annotate some protein sequences by predicting and comparing their structures with global and local structural features of well-characterized reference proteins. These powerful techniques, however, have been developed exclusively for full-length molecules, and use of similar approaches for peptides predicted from shortread metagenomic data has generally been avoided due to the belief that such peptides lack enough evolutionary or structural information to accurately identify the orthologous genes from which they originate. These concerns are underscored by the fact that protein domains are redundantly used to perform diverse biochemical activities [12, 13], leading to the expectation that short peptides will simply align to all the proteins carrying their “domains of origin”, resulting in a confounded pattern of functional predictions based on a variety of reference proteins carrying that domain [14, 15]. The three prominent resources for metagenomic data processing (MEGAN [16], MG-RAST [17] and HUMAnN [18]) all work similarly, aligning translated peptides from the short reads of NGS platforms to databases of wellannotated reference proteins and using single sets of sequence similarity measures (SSMs) for functional prediction. The effectiveness of individual sets of SSMs used by these protocols was recently questioned by the finding of the PAUDA study [19], where high variances in the identity profiles of alignment hits were observed even within the same KEGG-orthology group (KO) [20]. These observations resulted in concerns of significant sensitivity losses in assigning KO-families to short NGS reads on the basis of individual sets of SSMs. Moreover, recent publications using these metagenomic data processing methods also demonstrate absence of any consensus among the community of users regarding individual significance thresholds or sets of SSMs elements that can accurately discriminate between true and false-positive function assignments [21–26]. Given the dearth of empirically-derived data on the alignment behavior of peptides that could even be used to model thresholds for SSMs, we were motivated to systematically study the actual alignment behavior of short protein fragments. Using random peptides extracted from KO-family members (the “known” protein universe) and hypothetical uncharacterized prot (...truncated)