hppRNA—a Snakemake-based handy parameter-free pipeline for RNA-Seq analysis of numerous samples
Briefings in Bioinformatics, 19(4), 2018, 622–626
doi: 10.1093/bib/bbw143
Advance Access Publication Date: 17 January 2017
Paper
hppRNA—a Snakemake-based handy parameter-free
pipeline for RNA-Seq analysis of numerous samples
Corresponding author: Dapeng Wang, Department of Plant Sciences, University of Oxford, S Parks Rd, Oxford OX1 3RB, UK. E-mail:
Abstract
RNA-Seq technology has been gradually becoming a routine approach for characterizing the properties of transcriptome in terms of
organisms, cell types and conditions and consequently a big burden has been put on the facet of data analysis, which calls for an
easy-to-learn workflow to cope with the increased demands from a large number of laboratories across the world. We report a onein-all solution called hppRNA, composed of four scenarios such as pre-mapping, core-workflow, post-mapping and sequence variation detection, written by a series of individual Perl and R scripts, counting on well-established and preinstalled software, irrespective of single-end or paired-end, unstranded or stranded sequencing method. It features six independent core-workflows comprising
the state-of-the-art technology with dozens of popular cutting-edge tools such as Tophat–Cufflink–Cuffdiff, Subread–
featureCounts–DESeq2, STAR–RSEM–EBSeq, Bowtie–eXpress–edgeR, kallisto–sleuth, HISAT–StringTie–Ballgown, and embeds itself in
Snakemake, which is a modern pipeline management system. The core function of this pipeline is turning the raw fastq files into
gene/isoform expression matrix and differentially expressed genes or isoforms as well as the identification of fusion genes, single
nucleotide polymorphisms, long noncoding RNAs and circular RNAs. Last but not least, this pipeline is specifically designed for performing the systematic analysis on a huge set of samples in one go, ideally for the researchers who intend to deploy the pipeline on
their local servers. The scripts as well as the user manual are freely available at https://sourceforge.net/projects/hpprna/
Key words: RNA-Seq; pipeline; a large number of samples; gene expression profiling; sequence variation
Introduction
With the advancement of sequencing technology and big-data
analysis approach, RNA-Seq tends to be more prevalent and important in the biological laboratory in the current era and deems
one of the most dominant and efficient methodologies in the
measurement of gene expression [1, 2]. RNA-Seq has brought
about not only the novel chances of detection of low-expressed
genes and discrimination of isoforms belonging to the same
gene by means of unlimited in-depth sequencing, but also the
notable challenges in terms of both data structure and analysis
such as high-volume data, high-performance computing capacity and tricky processing [3, 4]. A typical RNA-Seq analysis involves multiple steps and hence the difficulty also arises from
the fact that it is imperative to enable consistency of the data
formats between the tools used in the adjacent steps. Apart
from this, researchers have to write scripts of their own to interpret and parse the intermediate outputs derived from a variety
of software.
To overcome the barrier, lots of pipeline programs for RNASeq analysis have been developed, including types of remotely
hosted and web-based servers and locally installed packages
based on a wide variety of programming or coding systems,
each of which has its particular strength and advantage. For instance, some tools accept complementary data formats as input, such as Chipster, wapRNA, PRADA, RseqFlow and RobiNA
[5–9]. Besides, RSEQtools establishes a model to protect the individual privacy by splitting the total information of alignments
into two parts and building a relationship between them for the
following in-depth analysis [10]. TRAPLINE constructs the networks of protein–protein and miRNA–target interactions [11].
TCW supports the cross-species transcriptome analysis through
the integration of a few evolutionary tools. ArrayExpressHTS
and easyRNASeq are chiefly implemented in R language and
both of them use R objects to store intermediate data and call
the other programs that are inside or outside R environment
[12, 13]. NGSUtils, ViennaNGS and S-MART provide a collection
of next generation sequencing (NGS) relevant tools that are
Dapeng Wang, Department of Plant Sciences, University of Oxford. He is working on the construction and application of a series of pipelines for nextgeneration-sequencing analysis.
Submitted: 19 September 2016; Received (in revised form): 12 December 2016
C The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email:
V
622
Dapeng Wang
A pipeline for RNA-Seq analysis
Methods
Genomic sequences in fasta format and gene annotations in gtf
format for two species such as human (hg19) and mouse
(mm10) were gleaned from iGenomes database provided by illumina corporation (ftp://ussd-ftp.illumina.com/). Only genes that
are defined as protein-coding and located in the complete
chromosomes could be retained. Another file composed of the
relationship between gene name and transcript name was
created for further use. To reduce the total amount of resource
data, index files for mapping will be produced only after respective workflow is evoked and will be stored within the workflow folder. Known long noncoding RNA (lncRNA) and repeat
annotation files were collected from GENCODE database [27]
and UCSC Table Browser [28], respectively.
Initially, the pipeline precedes with the raw fastq files generated from the sequencer or commercial repository such as
BaseSpace. The quality of reads is evaluated by FastQC (http://
www.bioinformatics.babraham.ac.uk/projects/fastqc/) for both
the raw fastq files and processed fastq files. Base quality filter is
done by means of PRINSEQ-lite [29] or FASTX-Toolkit (http://han
nonlab.cshl.edu/fastx_toolkit/) for paired-end or single-end
reads, respectively. Adaptor sequences, where applicable, could
be trimmed by Cutadapt [30] in light of library construction
methods. The number of unique reads is estimated by fastx_collapser (as part of FASTX-Toolkit) based on sequence identity.
In the main module of this pipeline, there are six categories
of typical workflows to work on the three steps including mapping, quantification and differentially expressed genes (DEGs)
detection: (1) Tophat–Cufflink–Cuffdiff [31–33]; (2) Subread–
featureCounts–DESeq2 [34–36]; (3) STAR–RSEM–EBSeq [37–39];
(4) Bowtie–eXpress–edgeR [40–42]; (5) kallisto–sleuth [43, 44]; (6)
HISAT–StringTie–Ballgown [45–47] (Figure 1). With the guidance
of the methodological designs from different workflows, genome sequences are fed to the mappers for workflow (1), (2), (3)
and (6), and transcript sequences are offered to those for workflow (4) and (5). In all cases, the BAM files, either in genomic coordinate or in transcript coordinate, would be generated by the
mappers, of which the former is preferred. Two matrix file (...truncated)