A survey of best practices for RNA-seq data analysis (pdf)

Article PDF cannot be displayed. You can download it here:

http://genomebiology.com/content/pdf/s13059-016-0881-8.pdf

A survey of best practices for RNA-seq data analysis

Conesa et al. Genome Biology (2016) 17:13 DOI 10.1186/s13059-016-0881-8 REVIEW Open Access A survey of best practices for RNA-seq data analysis Ana Conesa1,2*, Pedro Madrigal3,4*, Sonia Tarazona2,5, David Gomez-Cabrero6,7,8,9, Alejandra Cervera10, Andrew McPherson11, Michał Wojciech Szcześniak12, Daniel J. Gaffney3, Laura L. Elo13, Xuegong Zhang14,15 and Ali Mortazavi16,17* Abstract RNA-sequencing (RNA-seq) has a wide variety of applications, but no single analysis pipeline can be used in all cases. We review all of the major steps in RNA-seq data analysis, including experimental design, quality control, read alignment, quantification of gene and transcript levels, visualization, differential gene expression, alternative splicing, functional analysis, gene fusion detection and eQTL mapping. We highlight the challenges associated with each step. We discuss the analysis of small RNAs and the integration of RNA-seq with other functional genomics techniques. Finally, we discuss the outlook for novel technologies that are changing the state of the art in transcriptomics. Background Transcript identification and the quantification of gene expression have been distinct core activities in molecular biology ever since the discovery of RNA’s role as the key intermediate between the genome and the proteome. The power of sequencing RNA lies in the fact that the twin aspects of discovery and quantification can be combined in a single high-throughput sequencing assay called RNA-sequencing (RNA-seq). The pervasive adoption of RNA-seq has spread well beyond the genomics community and has become a standard part of the toolkit used by the life sciences research community. Many variations of RNA-seq protocols and analyses have been * Correspondence: ; ; 1 Institute for Food and Agricultural Sciences, Department of Microbiology and Cell Science, University of Florida, Gainesville, FL 32603, USA 3 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK 16 Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA 92697-2300, USA Full list of author information is available at the end of the article published, making it challenging for new users to appreciate all of the steps necessary to conduct an RNA-seq study properly. There is no optimal pipeline for the variety of different applications and analysis scenarios in which RNA-seq can be used. Scientists plan experiments and adopt different analysis strategies depending on the organism being studied and their research goals. For example, if a genome sequence is available for the studied organism, it should be possible to identify transcripts by mapping RNA-seq reads onto the genome. By contrast, for organisms without sequenced genomes, quantification would be achieved by first assembling reads de novo into contigs and then mapping these contigs onto the transcriptome. For well-annotated genomes such as the human genome, researchers may choose to base their RNA-seq analysis on the existing annotated reference transcriptome alone, or might try to identify new transcripts and their differential regulation. Furthermore, investigators might be interested only in messenger RNA isoform expression or microRNA (miRNA) levels or allele variant identification. Both the experimental design and the analysis procedures will vary greatly in each of these cases. RNA-seq can be used solo for transcriptome profiling or in combination with other functional genomics methods to enhance the analysis of gene expression. Finally, RNAseq can be coupled with different types of biochemical assay to analyze many other aspects of RNA biology, such as RNA–protein binding, RNA structure, or RNA–RNA interactions. These applications are, however, beyond the scope of this review as we focus on ‘typical’ RNA-seq. Every RNA-seq experimental scenario could potentially have different optimal methods for transcript quantification, normalization, and ultimately differential expression analysis. Moreover, quality control checks should be applied pertinently at different stages of the analysis to ensure both reproducibility and reliability of the results. Our focus is to outline current standards © 2016 Conesa et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Conesa et al. Genome Biology (2016) 17:13 and resources for the bioinformatics analysis of RNAseq data. We do not aim to provide an exhaustive compilation of resources or software tools nor to indicate one best analysis pipeline. Rather, we aim to provide a commented guideline for RNA-seq data analysis. Figure 1 depicts a generic roadmap for experimental design and analysis using standard Illumina sequencing. We also briefly list several data integration paradigms that have been proposed and comment on their potential and limitations. We finally discuss the opportunities as well as challenges provided by single-cell RNA-seq and longread technologies when compared to traditional shortread RNA-seq. Experimental design A crucial prerequisite for a successful RNA-seq study is that the data generated have the potential to answer the biological questions of interest. This is achieved by first defining a good experimental design, that is, by choosing the library type, sequencing depth and number of replicates appropriate for the biological system under study, Page 2 of 19 and second by planning an adequate execution of the sequencing experiment itself, ensuring that data acquisition does not become contaminated with unnecessary biases. In this section, we discuss both considerations. One important aspect of the experimental design is the RNA-extraction protocol used to remove the highly abundant ribosomal RNA (rRNA), which typically constitutes over 90 % of total RNA in the cell, leaving the 1–2 % comprising messenger RNA (mRNA) that we are normally interested in. For eukaryotes, this involves choosing whether to enrich for mRNA using poly(A) selection or to deplete rRNA. Poly(A) selection typically requires a relatively high proportion of mRNA with minimal degradation as measured by RNA integrity number (RIN), which normally yields a higher overall fraction of reads falling onto known exons. Many biologically relevant samples (such as tissue biopsies) cannot, however, be obtained in great enough quantity or good enough mRNA integrity to produce good poly(A) RNA-seq libraries and therefore require ribosomal depleti (...truncated)