RNAct: Protein–RNA interaction predictions for model organisms with supporting experimental data
Published online 16 November 2018
Nucleic Acids Research, 2019, Vol. 47, Database issue D601–D606
doi: 10.1093/nar/gky967
RNAct: Protein–RNA interaction predictions for model
organisms with supporting experimental data
Benjamin Lang
1
, Alexandros Armaos1 and Gian G. Tartaglia1,2,3,4,*
1
Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain,
Institució Catalana de Recerca i Estudis Avançats (ICREA), 23 Passeig Lluı́s Companys, Barcelona 08010, Spain,
3
Universitat Pompeu Fabra (UPF), Department of Experimental and Health Sciences, Barcelona 08003, Spain and
4
Department of Biology ‘Charles Darwin’, Sapienza University of Rome, P.le A. Moro 5, Rome 00185, Italy
2
ABSTRACT
Protein–RNA interactions are implicated in a number of physiological roles as well as diseases, with
molecular mechanisms ranging from defects in RNA
splicing, localization and translation to the formation of aggregates. Currently, ∼1400 human proteins
have experimental evidence of RNA-binding activity. However, only ∼250 of these proteins currently
have experimental data on their target RNAs from
various sequencing-based methods such as eCLIP.
To bridge this gap, we used an established, computationally expensive protein–RNA interaction prediction method, catRAPID, to populate a large database,
RNAct. RNAct allows easy lookup of known and predicted interactions and enables global views of the
human, mouse and yeast protein–RNA interactomes,
expanding them in a genome-wide manner far beyond experimental data (http://rnact.crg.eu).
INTRODUCTION
RNA-binding proteins (RBPs) are key in RNA splicing,
processing, export, localization and regulation of translation and are implicated in a number of pathologies in humans. Examples include heterogeneous and life-threatening
genetic disorders, such as amyotrophic lateral sclerosis (1),
spinocerebellar ataxia and retinitis pigmentosa, among others (2,3). Human proteins encoded by 1393 genes currently
have experimental evidence of RNA-binding activity (4–
6). These proteins contain one or more RNA-binding regions, either in the form of canonical globular domains or of
more recently discovered, intrinsically disordered RNA interaction regions (7,8). Additionally, protein–protein interaction interfaces and enzymatic active sites are sometimes
employed for RNA binding (4,9). Protein–RNA interactions form an intricate network, and RNAs play structural
roles in many types of phase-separated biological condensates, such as stress granules (10).
However, the number of RBPs for which the identity
of their interaction partners is known is much lower. Two
hundred fifty Homo sapiens RBPs currently have highthroughput experimental data on the identity of their target RNAs (11,12), obtained mostly by various sequencingbased methods such as eCLIP, iCLIP, HITS-CLIP, PARCLIP and RIP-seq. Much smaller datasets are available for
Mus musculus (38 RBPs (12)), Drosophila melanogaster (29
RBPs from RIP-seq (13)) and Saccharomyces cerevisiae (69
RBPs from RIP-Chip (14)). A comprehensive collection of
CLIP data is available in the recently expanded POSTAR
database (12), previously called CLIPdb, which also includes motif-based target predictions for a set of human and
mouse RBPs (88 and 82, respectively).
To bridge the gap between the 1393 known RBPs and
the 250 for which we have experimental knowledge of
interaction partners, we used an established, experimentally validated (15,16) protein–RNA interaction prediction method, catRAPID (17–19), to generate proteomeand transcriptome-wide sets of interaction predictions. Our
database now covers the H. sapiens, M. musculus and S.
cerevisiae genomes and contains a total of 5.87 billion pairwise interactions. This reflects nearly 120 years of computation time on the Centre for Genomic Regulation’s highperformance computing cluster, and for the first time provides all possible protein–RNA interactions in these species.
RNAct makes available our genome-wide protein–RNA
interaction predictions and combines them with powerful
and intuitive search functionality, including pairwise search
for sets of proteins and RNAs. The display is enriched with
useful annotation, including transcript support level (TSL)
and APPRIS classification for isoforms and RNA subcellular localization from the RNALocate database. Known
RBPs as well as interactions confirmed by large-scale experiments from the ENCODE project are clearly highlighted.
* To whom correspondence should be addressed. Tel: +34 933 160 116; Fax: +34 93 316 00 99; Email:
C The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Received August 15, 2018; Revised September 24, 2018; Editorial Decision September 28, 2018; Accepted October 11, 2018
D602 Nucleic Acids Research, 2019, Vol. 47, Database issue
MATERIALS AND METHODS
Proteomes
Transcriptomes
Transcriptomes were obtained from GENCODE (for human and mouse) (21) and Ensembl (for yeast) (22). GENCODE ‘basic’ RNAs are a representative subset prioritizing full-length protein-coding transcripts over partial or
non-coding transcripts for a given gene. The GENCODE
release used for human is Release 27 (genome assembly
GRCh38.p10), and both the ‘basic’ (98 608 transcripts with
successful interaction predictions) and ‘non-basic’ (100 722
transcripts) subsets were obtained for full coverage of the
human GENCODE transcriptome. These sets are kept separate for performance reasons, and the protein view currently does not show non-basic human RNAs (except in
the pairwise search). For mouse, GENCODE release M16
(genome assembly GRCm38.p5) was used, retaining only
the ‘basic’ subset (76 532 transcripts, ∼58% of the mouse
GENCODE transcriptome) due to resource and computation time constraints. For yeast, all coding and non-coding
transcripts from the Ensembl 92 release (April 2018) were
included (7029 transcripts with successful interaction predictions).
All FASTA sequence files used are available for download
in the RNAct Download section. A small number of these
sequences were excluded from RNAct due to limitations of
the catRAPID algorithm: short or extreme length (proteins
≤50 aa or >14 507 aa, RNAs ≤50 nt or >28 227 nt), or
unsuccessful RNA secondary structure prediction using the
ViennaRNA package which catRAPID relies on (23).
Interaction predictions (catRAPID maximum fragment
score)
To compute the interaction propensity scores, we used the
catRAPID approach (17) with the fragmentation procedure
(18,19) and normalized for sequence lengths (19). For each
protein–RNA pair, the fragments with the maximum interaction propensity score are used to assess overall binding
ability (...truncated)