RNAct: Protein–RNA interaction predictions for model organisms with supporting experimental data (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/nar/article-pdf/47/D1/D601/27436033/gky967.pdf

RNAct: Protein–RNA interaction predictions for model organisms with supporting experimental data

Published online 16 November 2018 Nucleic Acids Research, 2019, Vol. 47, Database issue D601–D606 doi: 10.1093/nar/gky967 RNAct: Protein–RNA interaction predictions for model organisms with supporting experimental data Benjamin Lang 1 , Alexandros Armaos1 and Gian G. Tartaglia1,2,3,4,* 1 Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 23 Passeig Lluı́s Companys, Barcelona 08010, Spain, 3 Universitat Pompeu Fabra (UPF), Department of Experimental and Health Sciences, Barcelona 08003, Spain and 4 Department of Biology ‘Charles Darwin’, Sapienza University of Rome, P.le A. Moro 5, Rome 00185, Italy 2 ABSTRACT Protein–RNA interactions are implicated in a number of physiological roles as well as diseases, with molecular mechanisms ranging from defects in RNA splicing, localization and translation to the formation of aggregates. Currently, ∼1400 human proteins have experimental evidence of RNA-binding activity. However, only ∼250 of these proteins currently have experimental data on their target RNAs from various sequencing-based methods such as eCLIP. To bridge this gap, we used an established, computationally expensive protein–RNA interaction prediction method, catRAPID, to populate a large database, RNAct. RNAct allows easy lookup of known and predicted interactions and enables global views of the human, mouse and yeast protein–RNA interactomes, expanding them in a genome-wide manner far beyond experimental data (http://rnact.crg.eu). INTRODUCTION RNA-binding proteins (RBPs) are key in RNA splicing, processing, export, localization and regulation of translation and are implicated in a number of pathologies in humans. Examples include heterogeneous and life-threatening genetic disorders, such as amyotrophic lateral sclerosis (1), spinocerebellar ataxia and retinitis pigmentosa, among others (2,3). Human proteins encoded by 1393 genes currently have experimental evidence of RNA-binding activity (4– 6). These proteins contain one or more RNA-binding regions, either in the form of canonical globular domains or of more recently discovered, intrinsically disordered RNA interaction regions (7,8). Additionally, protein–protein interaction interfaces and enzymatic active sites are sometimes employed for RNA binding (4,9). Protein–RNA interactions form an intricate network, and RNAs play structural roles in many types of phase-separated biological condensates, such as stress granules (10). However, the number of RBPs for which the identity of their interaction partners is known is much lower. Two hundred fifty Homo sapiens RBPs currently have highthroughput experimental data on the identity of their target RNAs (11,12), obtained mostly by various sequencingbased methods such as eCLIP, iCLIP, HITS-CLIP, PARCLIP and RIP-seq. Much smaller datasets are available for Mus musculus (38 RBPs (12)), Drosophila melanogaster (29 RBPs from RIP-seq (13)) and Saccharomyces cerevisiae (69 RBPs from RIP-Chip (14)). A comprehensive collection of CLIP data is available in the recently expanded POSTAR database (12), previously called CLIPdb, which also includes motif-based target predictions for a set of human and mouse RBPs (88 and 82, respectively). To bridge the gap between the 1393 known RBPs and the 250 for which we have experimental knowledge of interaction partners, we used an established, experimentally validated (15,16) protein–RNA interaction prediction method, catRAPID (17–19), to generate proteomeand transcriptome-wide sets of interaction predictions. Our database now covers the H. sapiens, M. musculus and S. cerevisiae genomes and contains a total of 5.87 billion pairwise interactions. This reflects nearly 120 years of computation time on the Centre for Genomic Regulation’s highperformance computing cluster, and for the first time provides all possible protein–RNA interactions in these species. RNAct makes available our genome-wide protein–RNA interaction predictions and combines them with powerful and intuitive search functionality, including pairwise search for sets of proteins and RNAs. The display is enriched with useful annotation, including transcript support level (TSL) and APPRIS classification for isoforms and RNA subcellular localization from the RNALocate database. Known RBPs as well as interactions confirmed by large-scale experiments from the ENCODE project are clearly highlighted. * To whom correspondence should be addressed. Tel: +34 933 160 116; Fax: +34 93 316 00 99; Email: C The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Received August 15, 2018; Revised September 24, 2018; Editorial Decision September 28, 2018; Accepted October 11, 2018 D602 Nucleic Acids Research, 2019, Vol. 47, Database issue MATERIALS AND METHODS Proteomes Transcriptomes Transcriptomes were obtained from GENCODE (for human and mouse) (21) and Ensembl (for yeast) (22). GENCODE ‘basic’ RNAs are a representative subset prioritizing full-length protein-coding transcripts over partial or non-coding transcripts for a given gene. The GENCODE release used for human is Release 27 (genome assembly GRCh38.p10), and both the ‘basic’ (98 608 transcripts with successful interaction predictions) and ‘non-basic’ (100 722 transcripts) subsets were obtained for full coverage of the human GENCODE transcriptome. These sets are kept separate for performance reasons, and the protein view currently does not show non-basic human RNAs (except in the pairwise search). For mouse, GENCODE release M16 (genome assembly GRCm38.p5) was used, retaining only the ‘basic’ subset (76 532 transcripts, ∼58% of the mouse GENCODE transcriptome) due to resource and computation time constraints. For yeast, all coding and non-coding transcripts from the Ensembl 92 release (April 2018) were included (7029 transcripts with successful interaction predictions). All FASTA sequence files used are available for download in the RNAct Download section. A small number of these sequences were excluded from RNAct due to limitations of the catRAPID algorithm: short or extreme length (proteins ≤50 aa or >14 507 aa, RNAs ≤50 nt or >28 227 nt), or unsuccessful RNA secondary structure prediction using the ViennaRNA package which catRAPID relies on (23). Interaction predictions (catRAPID maximum fragment score) To compute the interaction propensity scores, we used the catRAPID approach (17) with the fragmentation procedure (18,19) and normalized for sequence lengths (19). For each protein–RNA pair, the fragments with the maximum interaction propensity score are used to assess overall binding ability (...truncated)