Protein aggregation, structural disorder and RNA-binding ability: a new approach for physico-chemical and gene ontology classification of multiple datasets
Klus et al. BMC Genomics (2015) 16:1071
DOI 10.1186/s12864-015-2280-z
SOFTWARE
Open Access
Protein aggregation, structural disorder
and RNA-binding ability: a new approach
for physico-chemical and gene ontology
classification of multiple datasets
Petr Klus1,2, Riccardo Delli Ponti1,2, Carmen Maria Livi1,2 and Gian Gaetano Tartaglia1,2,3*
Abstract
Background: Comparison between multiple protein datasets requires the choice of an appropriate reference
system and a number of variables to describe their differences. Here we introduce an innovative approach to
discriminate multiple protein datasets (multiCM) and to measure enrichments in gene ontology terms (cleverGO)
using semantic similarities.
Results: We illustrate the powerfulness of our approach by investigating the links between RNA-binding ability and
other protein features, such as structural disorder and aggregation, in S. cerevisiae, C. elegans, M. musculus and H.
sapiens. Our results are in striking agreement with available experimental evidence and unravel features that are key
to understand the mechanisms regulating cellular homeostasis.
Conclusions: In an intuitive way, multiCM and cleverGO provide accurate classifications of physico-chemical
features and annotations of biological processes, molecular functions and cellular components, which is extremely
useful for the discovery and characterization of new trends in protein datasets. The multiCM and cleverGO can be
freely accessed on the Web at http://www.tartaglialab.com/cs_multi/submission and http://www.tartaglialab.com/
GO_analyser/universal. Each of the pages contains links to the corresponding documentation and tutorial.
Keywords: Protein classification, Physico-chemical properties, Gene ontology, Solubility, RNA-binding ability
Background
There is a growing gap between amount of proteomic
data and availability of tools for their analysis [1]. While
several application programming interfaces are available
to analyse computational and experimental results [2], a
simple and intuitive interface is currently lacking or
missing. Our goal is to start bridging this gap by providing algorithms for analysis of protein sets and discovery
of mechanisms that regulate protein function and
interactions.
The first method presented here, the multiCleverMachine (multiCM), is an extension of the cleverMachine
approach (CM [3]) to classify multiple protein datasets
* Correspondence:
1
Gene Function and Evolution, Centre for Genomic Regulation (CRG), Dr.
Aiguader 88, 08003 Barcelona, Spain
2
Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain
Full list of author information is available at the end of the article
using physico-chemical properties. The second algorithm, the cleverGO, is inspired by the need to simplify
Gene Ontology (GO) annotation output. While GO statistics are important to characterize the functional role
of proteins, their interpretation is difficult without further downstream processing [2, 4]. Current tools do not
provide a unique interface that combines GO term analysis with intuitive interpretation and visualization. For
instance, GOrilla [5] calculates GO terms enrichments,
but other tools are needed to summarize the results (e.g.
REVIGO [6]). cleverGO integrates multiple analyses in
one platform and facilitates GO processing through an
interactive analysis accessible via web browser.
We demonstrate the usefulness of our methods by
investigating the RNA-binding abilities of S. cerevisiae
chaperones and their substrates, the physico-chemical
determinants of protein insolubility in S. cerevisiae,
© 2015 Klus et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Klus et al. BMC Genomics (2015) 16:1071
M. musculus and H. sapiens, and the relationship
between aggregation and longevity in C. elegans. The purpose of our analysis is twofold: to provide examples that
can be used as a reference in other studies and to shed
light on the link between nucleic-acid binding abilities and
protein features, such as structural disorder and aggregation, that are increasingly recognized as key factors for
cellular function and homeostasis [7–9].
Implementation
The multiCM accepts multiple protein sets in FASTA
format. Individual sets are classified as positive or
negative for binary comparison (the assignment is only
needed to create two groups and does not influence the
calculations). In each list, the CM screens physicochemical properties encoded by protein sequences [3]
to identify those that best discriminate positive and
negative classes (currently supported physico-chemical
properties are: nucleic acid binding propensity, membrane propensity, alpha-helix propensity, aggregation
propensity, beta sheet propensity, burial propensity and
hydrophobicity, but custom properties can be included,
as explained in the online Tutorial). For a detailed description of CM performances, we refer to our previous
publication [3].
Page 2 of 7
In each multiCM run, the information is compiled together from individual models into a high-level
overview:
The user can glean what trend is detected in the
data using different physico-chemical features.
The indicators collate 10 predictors for each
selected feature and represent their consensus
with a colour, akin to a micro-array slide (Fig. 1a).
The colour of each array-spot represents differential
states of enrichment for the dataset pairs and allows
easy interpretation of increase, decrease or insufficient
signal.
The analysis is not restricted to the consensus information only - a link to a full CM view is provided in the
main panel (with details on p-value, cross-validation performances, ROC curves and other statistics). The detail
view contains ID number of the CM run providing the
ability to use it in creation of a cleverClassifier to study
new datasets [3], as well as a link to perform Gene
Ontology analysis using the second part of our toolkit,
the cleverGO.
The cleverGO webserver provides two ways to explore
data:
Fig. 1 RNA-binding abilities of S. cerevisiae chaperone substrates. a RNA-binding ability of yeast chaperones substrates is visualized in a microarray-like
table. Hsp90 and Hsp40 are predicted to have the largest number of nucleic-acid binding partners (Positive set: vertical axis; Negative set: horizontal
axis; Green: positive set is enriched with respect to negative set; Red: negative set is enriched with respect to positive set [3]; Yellow: non significant
enrichmen (...truncated)