A comprehensive software suite for protein family construction and functional site prediction

PLOS ONE, Feb 2017

In functionally diverse protein families, conservation in short signature regions may outperform full-length sequence comparisons for identifying proteins that belong to a subgroup within which one specific aspect of their function is conserved. The SIMBAL workflow (Sites Inferred by Metabolic Background Assertion Labeling) is a data-mining procedure for finding such signature regions. It begins by using clues from genomic context, such as co-occurrence or conserved gene neighborhoods, to build a useful training set from a large number of uncharacterized but mutually homologous proteins. When training set construction is successful, the YES partition is enriched in proteins that share function with the user’s query sequence, while the NO partition is depleted. A selected query sequence is then mined for short signature regions whose closest matches overwhelmingly favor proteins from the YES partition. High-scoring signature regions typically contain key residues critical to functional specificity, so proteins with the highest sequence similarity across these regions tend to share the same function. The SIMBAL algorithm was described previously, but significant manual effort, expertise, and a supporting software infrastructure were required to prepare the requisite training sets. Here, we describe a new, distributable software suite that speeds up and simplifies the process for using SIMBAL, most notably by providing tools that automate training set construction. These tools have broad utility for comparative genomics, allowing for flexible collection of proteins or protein domains based on genomic context as well as homology, a capability that can greatly assist in protein family construction. Armed with this new software suite, SIMBAL can serve as a fast and powerful in silico alternative to direct experimentation for characterizing proteins and their functional interactions.

A comprehensive software suite for protein family construction and functional site prediction

RESEARCH ARTICLE A comprehensive software suite for protein family construction and functional site prediction David Renfrew Haft1, Daniel H. Haft2* 1 J. Craig Venter Institute, Rockville, Maryland, United States of America, 2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 OPEN ACCESS Citation: Haft DR, Haft DH (2017) A comprehensive software suite for protein family construction and functional site prediction. PLoS ONE 12(2): e0171758. doi:10.1371/journal. pone.0171758 Editor: Olivier Lespinet, Universite Paris-Sud, FRANCE Received: August 29, 2016 Accepted: January 25, 2017 Published: February 9, 2017 Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. Data availability statement: We describe the use of a collection of prokaryotic genomes to be downloaded from NCBI, and provide a utility to aid in the download. We provide all software through GitHub. We provide demonstration files that represent output from each step of the analysis demonstrated in the manuscript. Funding: This work was supported by the National Science Foundation under Grant No. 1458808 to the J. Craig Venter Institute and by the Intramural Research Program of the NIH, National Library of Medicine. The funders had no role in study design, * Abstract In functionally diverse protein families, conservation in short signature regions may outperform full-length sequence comparisons for identifying proteins that belong to a subgroup within which one specific aspect of their function is conserved. The SIMBAL workflow (Sites Inferred by Metabolic Background Assertion Labeling) is a data-mining procedure for finding such signature regions. It begins by using clues from genomic context, such as co-occurrence or conserved gene neighborhoods, to build a useful training set from a large number of uncharacterized but mutually homologous proteins. When training set construction is successful, the YES partition is enriched in proteins that share function with the user’s query sequence, while the NO partition is depleted. A selected query sequence is then mined for short signature regions whose closest matches overwhelmingly favor proteins from the YES partition. High-scoring signature regions typically contain key residues critical to functional specificity, so proteins with the highest sequence similarity across these regions tend to share the same function. The SIMBAL algorithm was described previously, but significant manual effort, expertise, and a supporting software infrastructure were required to prepare the requisite training sets. Here, we describe a new, distributable software suite that speeds up and simplifies the process for using SIMBAL, most notably by providing tools that automate training set construction. These tools have broad utility for comparative genomics, allowing for flexible collection of proteins or protein domains based on genomic context as well as homology, a capability that can greatly assist in protein family construction. Armed with this new software suite, SIMBAL can serve as a fast and powerful in silico alternative to direct experimentation for characterizing proteins and their functional interactions. Introduction Data-mining methods can very efficiently generate hypotheses that certain protein families work together to carry out some biological process. Historically, analysis then often gets “stuck” waiting for experimental testing that may not be forthcoming. SIMBAL [1] allows for PLOS ONE | DOI:10.1371/journal.pone.0171758 February 9, 2017 1 / 15 A SIMBAL Software Suite data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. follow-up investigation in silico once correlations have been noted between pairs of protein families. By showing which features in a protein appear to matter most, SIMBAL can deepen our understanding of how molecular function links one family of proteins to another. Given a training set in which homologs from a functionally diverse protein superfamily have been labeled either YES or NO, according to features in the genomes from which they were taken, SIMBAL can detect short signature regions for which the best BLAST matches skew overwhelmingly toward the YES set. If a solved crystal structure exists for the query protein or one of its homologs, mapping the signature region identified by SIMBAL onto the crystal structure can shine a light on the underlying biology. In any newly sequenced genome, the functions of many proteins are unknown. Characterized or not, most proteins belong to some subsystem[2, 3]. In a subsystem, several components work together to carry out a biological process, such as biosynthesis of a cofactor, or import and utilization of a carbon source. HMM or BLAST searches readily find related proteins in different genomes, including homologs related closely enough to share a specific function. If such a functionally conserved protein is found in numerous species, other components of the subsystem(s) to which it belongs may be found in those species as well. This type of co-occurrence makes it possible for data-mining techniques such as Phylogenetic Profiling [4, 5], gene neighborhood analysis, operon detection, “Rosetta stone” gene fusion analysis, text mining, or several methods together, as in the STRING database [6], to identify sets of proteins that constitute previously undescribed subsystems, and that may carry out an undocumented biological process [7]. Unfortunately, there is a mismatch in speed, effort, and cost between generating the hypothesis that two families of proteins are connected through their roles in a subsystem—taking just seconds using bioinformatics methods—vs. the obvious follow-up laboratory work that might take years to set up and then complete. For one hypothesis at a time, the SIMBAL workflow lets an investigator use contextual clues from thousands of genomes to build a training set that will support further inquiry, and then use SIMBAL itself to search selected proteins for those short regions where highly similar amino acid sequences best reflect highly consistent genomic contexts. This purely in silico approach often yields confirmatory evidence for a functional connection between two proteins, plus new insights into functions and mechanisms, and may provide an attractive alternative to direct experimental assay. This paper presents numerous software components that transform SIMBAL from an algorithm whose setup and execution require specialized knowledge and significant expenditures of effort into (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0171758&type=printable
Article home page: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0171758

David Renfrew Haft, Daniel H. Haft. A comprehensive software suite for protein family construction and functional site prediction, PLOS ONE, 2017, Volume 12, Issue 2, DOI: 10.1371/journal.pone.0171758