A survey of best practices for RNA-seq data analysis
Conesa et al. Genome Biology (2016) 17:13
DOI 10.1186/s13059-016-0881-8
REVIEW
Open Access
A survey of best practices for RNA-seq data
analysis
Ana Conesa1,2*, Pedro Madrigal3,4*, Sonia Tarazona2,5, David Gomez-Cabrero6,7,8,9, Alejandra Cervera10,
Andrew McPherson11, Michał Wojciech Szcześniak12, Daniel J. Gaffney3, Laura L. Elo13, Xuegong Zhang14,15
and Ali Mortazavi16,17*
Abstract
RNA-sequencing (RNA-seq) has a wide variety of
applications, but no single analysis pipeline can be
used in all cases. We review all of the major steps in
RNA-seq data analysis, including experimental design,
quality control, read alignment, quantification of gene
and transcript levels, visualization, differential gene
expression, alternative splicing, functional analysis,
gene fusion detection and eQTL mapping. We
highlight the challenges associated with each step.
We discuss the analysis of small RNAs and the
integration of RNA-seq with other functional
genomics techniques. Finally, we discuss the outlook
for novel technologies that are changing the state of
the art in transcriptomics.
Background
Transcript identification and the quantification of gene
expression have been distinct core activities in molecular
biology ever since the discovery of RNA’s role as the key
intermediate between the genome and the proteome.
The power of sequencing RNA lies in the fact that the
twin aspects of discovery and quantification can be combined in a single high-throughput sequencing assay
called RNA-sequencing (RNA-seq). The pervasive adoption of RNA-seq has spread well beyond the genomics
community and has become a standard part of the toolkit
used by the life sciences research community. Many variations of RNA-seq protocols and analyses have been
* Correspondence: ; ;
1
Institute for Food and Agricultural Sciences, Department of Microbiology
and Cell Science, University of Florida, Gainesville, FL 32603, USA
3
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SA, UK
16
Department of Developmental and Cell Biology, University of California,
Irvine, Irvine, CA 92697-2300, USA
Full list of author information is available at the end of the article
published, making it challenging for new users to appreciate all of the steps necessary to conduct an RNA-seq study
properly.
There is no optimal pipeline for the variety of different
applications and analysis scenarios in which RNA-seq
can be used. Scientists plan experiments and adopt different analysis strategies depending on the organism being studied and their research goals. For example, if a
genome sequence is available for the studied organism,
it should be possible to identify transcripts by mapping
RNA-seq reads onto the genome. By contrast, for organisms without sequenced genomes, quantification would
be achieved by first assembling reads de novo into contigs and then mapping these contigs onto the transcriptome. For well-annotated genomes such as the human
genome, researchers may choose to base their RNA-seq
analysis on the existing annotated reference transcriptome alone, or might try to identify new transcripts and
their differential regulation. Furthermore, investigators
might be interested only in messenger RNA isoform expression or microRNA (miRNA) levels or allele variant
identification. Both the experimental design and the analysis procedures will vary greatly in each of these cases.
RNA-seq can be used solo for transcriptome profiling or
in combination with other functional genomics methods
to enhance the analysis of gene expression. Finally, RNAseq can be coupled with different types of biochemical
assay to analyze many other aspects of RNA biology, such
as RNA–protein binding, RNA structure, or RNA–RNA
interactions. These applications are, however, beyond the
scope of this review as we focus on ‘typical’ RNA-seq.
Every RNA-seq experimental scenario could potentially have different optimal methods for transcript
quantification, normalization, and ultimately differential
expression analysis. Moreover, quality control checks
should be applied pertinently at different stages of the
analysis to ensure both reproducibility and reliability of
the results. Our focus is to outline current standards
© 2016 Conesa et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Conesa et al. Genome Biology (2016) 17:13
and resources for the bioinformatics analysis of RNAseq data. We do not aim to provide an exhaustive compilation of resources or software tools nor to indicate
one best analysis pipeline. Rather, we aim to provide a
commented guideline for RNA-seq data analysis. Figure 1
depicts a generic roadmap for experimental design and
analysis using standard Illumina sequencing. We also
briefly list several data integration paradigms that have
been proposed and comment on their potential and limitations. We finally discuss the opportunities as well as
challenges provided by single-cell RNA-seq and longread technologies when compared to traditional shortread RNA-seq.
Experimental design
A crucial prerequisite for a successful RNA-seq study is
that the data generated have the potential to answer the
biological questions of interest. This is achieved by first
defining a good experimental design, that is, by choosing
the library type, sequencing depth and number of replicates appropriate for the biological system under study,
Page 2 of 19
and second by planning an adequate execution of the sequencing experiment itself, ensuring that data acquisition does not become contaminated with unnecessary
biases. In this section, we discuss both considerations.
One important aspect of the experimental design is
the RNA-extraction protocol used to remove the highly
abundant ribosomal RNA (rRNA), which typically constitutes over 90 % of total RNA in the cell, leaving the
1–2 % comprising messenger RNA (mRNA) that we are
normally interested in. For eukaryotes, this involves
choosing whether to enrich for mRNA using poly(A) selection or to deplete rRNA. Poly(A) selection typically
requires a relatively high proportion of mRNA with minimal degradation as measured by RNA integrity number
(RIN), which normally yields a higher overall fraction of
reads falling onto known exons. Many biologically relevant samples (such as tissue biopsies) cannot, however,
be obtained in great enough quantity or good enough
mRNA integrity to produce good poly(A) RNA-seq libraries and therefore require ribosomal depleti (...truncated)