Analysis of Whole Transcriptome Sequencing Data: Workflow and Software
G&I
eISSN 2234-0742
Genomics
& Informatics
13, No. 4, 2015
Genomics
InformVol.
2015;13(4):119-125
http://dx.doi.org/10.5808/GI.2015.13.4.119
Genomics & Informatics
REVIEW ARTICLE
Analysis of Whole Transcriptome Sequencing Data:
Workflow and Software
In Seok Yang, Sangwoo Kim*
Severance Biomedical Science Institute, Yonsei University College of Medicine, Seoul 03722, Korea
RNA is a polymeric molecule implicated in various biological processes, such as the coding, decoding, regulation, and
expression of genes. Numerous studies have examined RNA features using whole transcriptome sequencing (RNA-seq)
approaches. RNA-seq is a powerful technique for characterizing and quantifying the transcriptome and accelerates the
development of bioinformatics software. In this review, we introduce routine RNA-seq workflow together with related
software, focusing particularly on transcriptome reconstruction and expression quantification.
Keywords: bioinformatics tools, gene expression, high-throughput RNA sequencing, transcript
Introduction
The transcriptome is the entire set of RNA transcripts in
a given cell for a specific developmental stage or physiological condition [1]. Understanding the transcriptome is
necessary for interpreting the functional elements of the
genome as well as for understanding the underlying
mechanisms of development and disease. Microarray technologies have been used for high-throughput large-scale
RNA-level studies, such as to identify differentially expressed genes between developmental stages or between
healthy and diseased groups [2]. However, its hybridization-based nature limits the ability to catalog and quantify
RNA molecules expressed under various conditions.
Advances in massive parallel DNA sequencing technologies
have enabled transcriptome sequencing (RNA-seq) by
sequencing of cDNA. RNA-seq has rapidly replaced
microarray technology because of its better resolution and
higher reproducibility; this method can be used to extend
our knowledge of alternative splicing events [3], novel genes
and transcripts [4], and fusion transcripts [5].
One concern regarding the application of RNA-seq is
abundance estimation at the gene-level and transcript-level
differential expression under distinct conditions. Routine
RNA-seq workflow may consist of the following five steps as
shown in Fig. 1: (1) preprocessing of raw data, (2) read
alignment, (3) transcriptome reconstruction, (4) expression
quantification, and (5) differential expression analysis. As an
initial step, RNA-seq data may be subjected to quality
control (QC) of the raw data before data analysis. Similar to
whole genome or exome sequencing, read alignment can be
performed to map the reads to the reference genome or
transcriptome. Clinical samples including formalin-fixed
paraffin-embedded specimen and cancer tissue biopsies are
often degraded or exist in limited amount [6]. Thus
additional QC procedure can be performed to evaluate the
performance of the RNA-seq experiment itself after read
alignment. Next, transcriptome reconstruction is carried out
to identify all transcripts expressed in a specimen based on
read mapping data. If there is no available reference
sequence, this procedure can be conducted directly using a de
novo assembly approach. Once all transcripts are identified,
their abundances can be estimated using read mapping data.
Finally, differential expression analysis is conducted using
currently available programs. In this review, we discuss the
RNA-seq workflow and its related bioinformatics tools in
each step (Table 1), focusing on transcriptome reconstruction and abundance quantification.
Received October 13, 2015; Revised December 10, 2015; Accepted December 12, 2015
*Corresponding author: Tel: +82-2-2228-0913, Fax: +82-2-2227-8129, E-mail:
Copyright © 2015 by the Korea Genome Organization
CC It is identical to the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/).
www.genominfo.org
119
IS Yang and S Kim. RNA-Seq Analysis Workflow and Software
from aligning. The adapter trimming step is typically not
necessary, as most recent sequencers provide raw data in
which the adapters are already trimmed. In contrast, quality
trimming may be an essential step depending on the analysis
strategy used. The FASTX-Toolkit [10] and FLEXBAR [11]
are useful for this purpose.
Read Alignment
Fig. 1. Typical workflow for RNA sequencing (RNA-seq) data
analysis. This workflow shows an example for expression quantification and differential expression analysis at gene and/or transcript
level using RNA-seq, which is typically consisted of five steps as
following: preprocessing, read alignment, transcriptome reconstruction, expression quantification and differential expression analysis.
For each step, currently available programs are written in Table 1.
QC, quality control.
Preprocessing of Raw Data
Similarly to whole genome or exome sequencing, RNAseq data is formatted in FASTQ (sequence and base quality).
Numerous erroneous sequence variants can be introduced
during the library preparation, sequencing, and imaging
steps [7], which should be identified and filtered out in the
data analysis step. Thus, QC of raw data should be performed
as the initial step of routine RNA-seq workflow. Tools such
as FastQC [8] and HTQC [9] can be applied in this step to
assess the quality of raw data, enabling assessment of the
overall and per-base quality for each read (i.e., read 1 and 2
in case of paired-end sequencing) in each sample. Depending
on the RNA-seq library construction strategy, some form of
read trimming may be advisable prior to aligning the
RNA-seq data. Two common trimming strategies include
“adapter trimming” and “quality trimming.” Adapter
trimming involves removal of the adapter sequence by masking specific sequences used during library construction.
Quality trimming generally removes the ends of reads where
base quality scores have decreased to a level such that
sequence errors and the resulting mismatches prevent reads
120
There are two strategies in which a genome or transcriptome is used as a reference for the read alignment step
[12]. The transcriptome comprises all transcripts in a given
specimen and in which splicing has been conducted by
including the exons and excluding the introns. If a
transcriptome is used as a reference, unspliced aligners that
do not allow large gaps may be the proper choice for accurate
read mapping. Stampy, Mapping and Assembly with Quality
(MAQ) [13], Burrow-Wheeler Aligner (BWA) [14], and
Bowtie [15] can be used in this case. This alignment is
limited to the identification of known exons and junctions
because it does not identify splicing events involving novel
exons. However, if the genome is used as a reference, spliced
aligners that allow a wide range of gaps should be employed
because reads aligned at exon-exon junctions will be split
into two fragments. This approach may increase the probability of (...truncated)