Analysis of Whole Transcriptome Sequencing Data: Workflow and Software (pdf)

Article PDF cannot be displayed. You can download it here:

http://genominfo.org/upload/pdf/gni-13-119.pdf

Analysis of Whole Transcriptome Sequencing Data: Workflow and Software

G&I eISSN 2234-0742 Genomics & Informatics 13, No. 4, 2015 Genomics InformVol. 2015;13(4):119-125 http://dx.doi.org/10.5808/GI.2015.13.4.119 Genomics & Informatics REVIEW ARTICLE Analysis of Whole Transcriptome Sequencing Data: Workflow and Software In Seok Yang, Sangwoo Kim* Severance Biomedical Science Institute, Yonsei University College of Medicine, Seoul 03722, Korea RNA is a polymeric molecule implicated in various biological processes, such as the coding, decoding, regulation, and expression of genes. Numerous studies have examined RNA features using whole transcriptome sequencing (RNA-seq) approaches. RNA-seq is a powerful technique for characterizing and quantifying the transcriptome and accelerates the development of bioinformatics software. In this review, we introduce routine RNA-seq workflow together with related software, focusing particularly on transcriptome reconstruction and expression quantification. Keywords: bioinformatics tools, gene expression, high-throughput RNA sequencing, transcript Introduction The transcriptome is the entire set of RNA transcripts in a given cell for a specific developmental stage or physiological condition [1]. Understanding the transcriptome is necessary for interpreting the functional elements of the genome as well as for understanding the underlying mechanisms of development and disease. Microarray technologies have been used for high-throughput large-scale RNA-level studies, such as to identify differentially expressed genes between developmental stages or between healthy and diseased groups [2]. However, its hybridization-based nature limits the ability to catalog and quantify RNA molecules expressed under various conditions. Advances in massive parallel DNA sequencing technologies have enabled transcriptome sequencing (RNA-seq) by sequencing of cDNA. RNA-seq has rapidly replaced microarray technology because of its better resolution and higher reproducibility; this method can be used to extend our knowledge of alternative splicing events [3], novel genes and transcripts [4], and fusion transcripts [5]. One concern regarding the application of RNA-seq is abundance estimation at the gene-level and transcript-level differential expression under distinct conditions. Routine RNA-seq workflow may consist of the following five steps as shown in Fig. 1: (1) preprocessing of raw data, (2) read alignment, (3) transcriptome reconstruction, (4) expression quantification, and (5) differential expression analysis. As an initial step, RNA-seq data may be subjected to quality control (QC) of the raw data before data analysis. Similar to whole genome or exome sequencing, read alignment can be performed to map the reads to the reference genome or transcriptome. Clinical samples including formalin-fixed paraffin-embedded specimen and cancer tissue biopsies are often degraded or exist in limited amount [6]. Thus additional QC procedure can be performed to evaluate the performance of the RNA-seq experiment itself after read alignment. Next, transcriptome reconstruction is carried out to identify all transcripts expressed in a specimen based on read mapping data. If there is no available reference sequence, this procedure can be conducted directly using a de novo assembly approach. Once all transcripts are identified, their abundances can be estimated using read mapping data. Finally, differential expression analysis is conducted using currently available programs. In this review, we discuss the RNA-seq workflow and its related bioinformatics tools in each step (Table 1), focusing on transcriptome reconstruction and abundance quantification. Received October 13, 2015; Revised December 10, 2015; Accepted December 12, 2015 *Corresponding author: Tel: +82-2-2228-0913, Fax: +82-2-2227-8129, E-mail: Copyright © 2015 by the Korea Genome Organization CC It is identical to the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/). www.genominfo.org 119 IS Yang and S Kim. RNA-Seq Analysis Workflow and Software from aligning. The adapter trimming step is typically not necessary, as most recent sequencers provide raw data in which the adapters are already trimmed. In contrast, quality trimming may be an essential step depending on the analysis strategy used. The FASTX-Toolkit [10] and FLEXBAR [11] are useful for this purpose. Read Alignment Fig. 1. Typical workflow for RNA sequencing (RNA-seq) data analysis. This workflow shows an example for expression quantification and differential expression analysis at gene and/or transcript level using RNA-seq, which is typically consisted of five steps as following: preprocessing, read alignment, transcriptome reconstruction, expression quantification and differential expression analysis. For each step, currently available programs are written in Table 1. QC, quality control. Preprocessing of Raw Data Similarly to whole genome or exome sequencing, RNAseq data is formatted in FASTQ (sequence and base quality). Numerous erroneous sequence variants can be introduced during the library preparation, sequencing, and imaging steps [7], which should be identified and filtered out in the data analysis step. Thus, QC of raw data should be performed as the initial step of routine RNA-seq workflow. Tools such as FastQC [8] and HTQC [9] can be applied in this step to assess the quality of raw data, enabling assessment of the overall and per-base quality for each read (i.e., read 1 and 2 in case of paired-end sequencing) in each sample. Depending on the RNA-seq library construction strategy, some form of read trimming may be advisable prior to aligning the RNA-seq data. Two common trimming strategies include “adapter trimming” and “quality trimming.” Adapter trimming involves removal of the adapter sequence by masking specific sequences used during library construction. Quality trimming generally removes the ends of reads where base quality scores have decreased to a level such that sequence errors and the resulting mismatches prevent reads 120 There are two strategies in which a genome or transcriptome is used as a reference for the read alignment step [12]. The transcriptome comprises all transcripts in a given specimen and in which splicing has been conducted by including the exons and excluding the introns. If a transcriptome is used as a reference, unspliced aligners that do not allow large gaps may be the proper choice for accurate read mapping. Stampy, Mapping and Assembly with Quality (MAQ) [13], Burrow-Wheeler Aligner (BWA) [14], and Bowtie [15] can be used in this case. This alignment is limited to the identification of known exons and junctions because it does not identify splicing events involving novel exons. However, if the genome is used as a reference, spliced aligners that allow a wide range of gaps should be employed because reads aligned at exon-exon junctions will be split into two fragments. This approach may increase the probability of (...truncated)