catRAPID omics: a web server for large-scale prediction of protein–RNA interactions
Federico Agostini
0
1
Andreas Zanzoni
0
1
Petr Klus
0
1
Domenica Marchese
0
1
Davide Cirillo
0
1
Gian Gaetano Tartaglia
0
1
Associate Editor: Ivo Hofacker
0
Universitat Pompeu Fabra (UPF)
,
08003 Barcelona, Spain
1
Gene Function and Evolution,
Bioinformatics and Genomics, Centre for Genomic Regulation (CRG)
Summary: Here we introduce catRAPID omics, a server for largescale calculations of protein-RNA interactions. Our web server allows (i) predictions at proteomic and transcriptomic level; (ii) use of protein and RNA sequences without size restriction; (iii) analysis of nucleic acid binding regions in proteins; and (iv) detection of RNA motifs involved in protein recognition. Results: We developed a web server to allow fast calculation of ribonucleoprotein associations in Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, Homo sapiens, Mus musculus, Rattus norvegicus, Saccharomyces cerevisiae and Xenopus tropicalis (custom libraries can be also generated). The catRAPID omics was benchmarked on the recently published RNA interactomes of Serine/ arginine-rich splicing factor 1 (SRSF1), Histone-lysine N-methyltransferase EZH2 (EZH2), TAR DNA-binding protein 43 (TDP43) and RNAbinding protein FUS (FUS) as well as on the protein interactomes of U1/U2 small nucleolar RNAs, X inactive specific transcript (Xist) repeat A region (RepA) and Crumbs homolog 3 (CRB3) 30-untranslated region RNAs. Our predictions are highly significant (P50.05) and will help the experimentalist to identify candidates for further validation. Availability: catRAPID omics can be freely accessed on the Web at http://s.tartaglialab.com/catrapid/omics. Documentation, tutorial and FAQs are available at http://s.tartaglialab.com/page/catrapid_group. Contact: The Author 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
1 INTRODUCTION
Increasing evidence indicates that ribonucleoprotein interactions
are fundamental for cellular regulation (Khalil and Rinn, 2011).
Moreover, several studies highlighted the involvement of RNA
molecules in the onset and progression of human diseases
including neurological disorders (Johnson et al., 2012). To our
knowledge, there are two sequence-based methods for prediction of
proteinRNA interactions: catRAPID (Bellucci et al., 2011) and
RPISeq (Muppirala et al., 2011). The catRAPID algorithm
exploits predictions of secondary structure, hydrogen bonding
and van der Waals contributions to estimate the binding
propensity of protein and RNA molecules. RPISeq is based on
support vector machine (SVM) and random forest (RF)
*To whom correspondence should be addressed.
models predicting proteinRNA interactions from primary
structure alone (Muppirala et al., 2011). Both methods show
remarkable performances, but catRAPID discriminates
positive and negative cases with higher accuracy (Cirillo et al.,
2013b) and has been tested on long non-coding RNAs
(Agostini et al., 2013).
Here we introduce catRAPID omics to perform
high-throughput predictions of proteinRNA interactions using the
information on protein and RNA domains involved in macromolecular
recognition.
WORKFLOW AND IMPLEMENTATION
The catRAPID omics server provides two main services to
explore the interaction potential of (i) a protein of interest with
respect to a target transcriptome or (ii) a given RNA with respect
to the nucleic acid binding proteome. Several options are
available to refine the type of analysis in eight model organisms or
custom libraries (see online documentation):
In the case of a protein query, catRAPID omics takes as
input the protein sequence (FASTA format): full-length or,
alternatively, nucleic acid binding regions.
For a transcript query (FASTA format), the server uses the
full-length sequence if below 1200 nt, or, alternatively, uses
fragments with predicted stable secondary structure
(Agostini et al., 2013). Full-length proteins and nucleic
acid binding regions can be searched.
The server automatically detects disordered proteins lacking
canonical RNA binding domains. Indeed, it has been
observed that disordered regions are enriched in RNA
binding proteins (Castello et al., 2012).
As RNA motifs are important for protein recognition
(Kazan et al., 2010), a search for these elements is carried
out. The motifs were taken from RNA-Binding Protein
DataBase (RBPDB) (Cook et al., 2011), SpliceAid-F
(Giulietti et al., 2013) and a recent motif compendium
(Ray et al., 2013).
Using the interaction propensities distribution, catRAPID
omics predicts the RNA binding ability of the input protein
(86% accuracy) and ranks RNA interactions (downloadable
by the user).
catRapid omics
In the output page (Fig. 1A), we report all the variables used
to estimate proteinRNA associations: interaction propensity
(Bellucci et al., 2011), discriminative power (Bellucci et al.,
2011), interaction strength (Agostini et al., 2013) and presence
of protein RNA binding domains as well as RNA motifs. A star
rating system ranks the binding propensities (http://service.tar
taglialab.com/static_files/shared/faqs.html). As for the reference
sets, ENSEMBL (version 68) is used for retrieval and
classification of coding and non-coding RNAs, whereas protein sequences
are gathered from the UniProtKB database (release 2012_11).
Finally, catRAPID omics uses hmmscan, a Hidden Markov
Model-based algorithm from the HMMER3 package (Finn
et al., 2011), to identify known PfamA domains (Finn et al.,
2009) and recognize protein regions involved in binding nucleic
acid molecules. Algorithm hit significance is determined
according to the PfamA gathering thresholds.
PERFORMANCES
The catRAPID algorithm has been previously validated on a
number of proteinRNA associations (Agostini et al., 2013;
Bellucci et al., 2011; Cirillo, et al., 2013a; Johnson et al., 2012).
To evaluate large-scale performances of catRAPID omics, we
used data from recent large-scale experiments. To compare
predicted and experimental interactions, we used Fishers exact test.
As shown in Figure 1B, performances on the human splicing
factor serine/arginine-rich splicing factor 1 (SRSF1) (Sanford
et al., 2009) and murine nucleic acid binding protein
Histonelysine N-methyltransferase EZH2 (EZH2) (Zhao et al., 2010) are
highly significant (P-values: 0.01 and 0.01, respectively). Good
performances are found for low-throughput experiments on
murine non-coding X inactive specific transcript (Xist) repeat
A region (RepA) (Maenner et al., 2010; Royce-Tolland et al.,
2010) and yeast small nuclear RNA U1 (Cvitkovic and Jurica,
2012) (P-values: 0.03 and 0.015) (Fig. 1B). To illustrate the
ability of catRAPID omics to predict interactions with nucleic acid
binding domains (Fig. 1C), we used murine FUS (Han et al.,
2012 (...truncated)