Protein aggregation, structural disorder and RNA-binding ability: a new approach for physico-chemical and gene ontology classification of multiple datasets (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcgenomics.biomedcentral.com/track/pdf/10.1186/s12864-015-2280-z

Protein aggregation, structural disorder and RNA-binding ability: a new approach for physico-chemical and gene ontology classification of multiple datasets

Klus et al. BMC Genomics (2015) 16:1071 DOI 10.1186/s12864-015-2280-z SOFTWARE Open Access Protein aggregation, structural disorder and RNA-binding ability: a new approach for physico-chemical and gene ontology classification of multiple datasets Petr Klus1,2, Riccardo Delli Ponti1,2, Carmen Maria Livi1,2 and Gian Gaetano Tartaglia1,2,3* Abstract Background: Comparison between multiple protein datasets requires the choice of an appropriate reference system and a number of variables to describe their differences. Here we introduce an innovative approach to discriminate multiple protein datasets (multiCM) and to measure enrichments in gene ontology terms (cleverGO) using semantic similarities. Results: We illustrate the powerfulness of our approach by investigating the links between RNA-binding ability and other protein features, such as structural disorder and aggregation, in S. cerevisiae, C. elegans, M. musculus and H. sapiens. Our results are in striking agreement with available experimental evidence and unravel features that are key to understand the mechanisms regulating cellular homeostasis. Conclusions: In an intuitive way, multiCM and cleverGO provide accurate classifications of physico-chemical features and annotations of biological processes, molecular functions and cellular components, which is extremely useful for the discovery and characterization of new trends in protein datasets. The multiCM and cleverGO can be freely accessed on the Web at http://www.tartaglialab.com/cs_multi/submission and http://www.tartaglialab.com/ GO_analyser/universal. Each of the pages contains links to the corresponding documentation and tutorial. Keywords: Protein classification, Physico-chemical properties, Gene ontology, Solubility, RNA-binding ability Background There is a growing gap between amount of proteomic data and availability of tools for their analysis [1]. While several application programming interfaces are available to analyse computational and experimental results [2], a simple and intuitive interface is currently lacking or missing. Our goal is to start bridging this gap by providing algorithms for analysis of protein sets and discovery of mechanisms that regulate protein function and interactions. The first method presented here, the multiCleverMachine (multiCM), is an extension of the cleverMachine approach (CM [3]) to classify multiple protein datasets * Correspondence: 1 Gene Function and Evolution, Centre for Genomic Regulation (CRG), Dr. Aiguader 88, 08003 Barcelona, Spain 2 Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain Full list of author information is available at the end of the article using physico-chemical properties. The second algorithm, the cleverGO, is inspired by the need to simplify Gene Ontology (GO) annotation output. While GO statistics are important to characterize the functional role of proteins, their interpretation is difficult without further downstream processing [2, 4]. Current tools do not provide a unique interface that combines GO term analysis with intuitive interpretation and visualization. For instance, GOrilla [5] calculates GO terms enrichments, but other tools are needed to summarize the results (e.g. REVIGO [6]). cleverGO integrates multiple analyses in one platform and facilitates GO processing through an interactive analysis accessible via web browser. We demonstrate the usefulness of our methods by investigating the RNA-binding abilities of S. cerevisiae chaperones and their substrates, the physico-chemical determinants of protein insolubility in S. cerevisiae, © 2015 Klus et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Klus et al. BMC Genomics (2015) 16:1071 M. musculus and H. sapiens, and the relationship between aggregation and longevity in C. elegans. The purpose of our analysis is twofold: to provide examples that can be used as a reference in other studies and to shed light on the link between nucleic-acid binding abilities and protein features, such as structural disorder and aggregation, that are increasingly recognized as key factors for cellular function and homeostasis [7–9]. Implementation The multiCM accepts multiple protein sets in FASTA format. Individual sets are classified as positive or negative for binary comparison (the assignment is only needed to create two groups and does not influence the calculations). In each list, the CM screens physicochemical properties encoded by protein sequences [3] to identify those that best discriminate positive and negative classes (currently supported physico-chemical properties are: nucleic acid binding propensity, membrane propensity, alpha-helix propensity, aggregation propensity, beta sheet propensity, burial propensity and hydrophobicity, but custom properties can be included, as explained in the online Tutorial). For a detailed description of CM performances, we refer to our previous publication [3]. Page 2 of 7 In each multiCM run, the information is compiled together from individual models into a high-level overview: The user can glean what trend is detected in the data using different physico-chemical features. The indicators collate 10 predictors for each selected feature and represent their consensus with a colour, akin to a micro-array slide (Fig. 1a). The colour of each array-spot represents differential states of enrichment for the dataset pairs and allows easy interpretation of increase, decrease or insufficient signal. The analysis is not restricted to the consensus information only - a link to a full CM view is provided in the main panel (with details on p-value, cross-validation performances, ROC curves and other statistics). The detail view contains ID number of the CM run providing the ability to use it in creation of a cleverClassifier to study new datasets [3], as well as a link to perform Gene Ontology analysis using the second part of our toolkit, the cleverGO. The cleverGO webserver provides two ways to explore data: Fig. 1 RNA-binding abilities of S. cerevisiae chaperone substrates. a RNA-binding ability of yeast chaperones substrates is visualized in a microarray-like table. Hsp90 and Hsp40 are predicted to have the largest number of nucleic-acid binding partners (Positive set: vertical axis; Negative set: horizontal axis; Green: positive set is enriched with respect to negative set; Red: negative set is enriched with respect to positive set [3]; Yellow: non significant enrichmen (...truncated)