hppRNA—a Snakemake-based handy parameter-free pipeline for RNA-Seq analysis of numerous samples (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bib/article-pdf/19/4/622/25193091/bbw143.pdf

hppRNA—a Snakemake-based handy parameter-free pipeline for RNA-Seq analysis of numerous samples

Briefings in Bioinformatics, 19(4), 2018, 622–626 doi: 10.1093/bib/bbw143 Advance Access Publication Date: 17 January 2017 Paper hppRNA—a Snakemake-based handy parameter-free pipeline for RNA-Seq analysis of numerous samples Corresponding author: Dapeng Wang, Department of Plant Sciences, University of Oxford, S Parks Rd, Oxford OX1 3RB, UK. E-mail: Abstract RNA-Seq technology has been gradually becoming a routine approach for characterizing the properties of transcriptome in terms of organisms, cell types and conditions and consequently a big burden has been put on the facet of data analysis, which calls for an easy-to-learn workflow to cope with the increased demands from a large number of laboratories across the world. We report a onein-all solution called hppRNA, composed of four scenarios such as pre-mapping, core-workflow, post-mapping and sequence variation detection, written by a series of individual Perl and R scripts, counting on well-established and preinstalled software, irrespective of single-end or paired-end, unstranded or stranded sequencing method. It features six independent core-workflows comprising the state-of-the-art technology with dozens of popular cutting-edge tools such as Tophat–Cufflink–Cuffdiff, Subread– featureCounts–DESeq2, STAR–RSEM–EBSeq, Bowtie–eXpress–edgeR, kallisto–sleuth, HISAT–StringTie–Ballgown, and embeds itself in Snakemake, which is a modern pipeline management system. The core function of this pipeline is turning the raw fastq files into gene/isoform expression matrix and differentially expressed genes or isoforms as well as the identification of fusion genes, single nucleotide polymorphisms, long noncoding RNAs and circular RNAs. Last but not least, this pipeline is specifically designed for performing the systematic analysis on a huge set of samples in one go, ideally for the researchers who intend to deploy the pipeline on their local servers. The scripts as well as the user manual are freely available at https://sourceforge.net/projects/hpprna/ Key words: RNA-Seq; pipeline; a large number of samples; gene expression profiling; sequence variation Introduction With the advancement of sequencing technology and big-data analysis approach, RNA-Seq tends to be more prevalent and important in the biological laboratory in the current era and deems one of the most dominant and efficient methodologies in the measurement of gene expression [1, 2]. RNA-Seq has brought about not only the novel chances of detection of low-expressed genes and discrimination of isoforms belonging to the same gene by means of unlimited in-depth sequencing, but also the notable challenges in terms of both data structure and analysis such as high-volume data, high-performance computing capacity and tricky processing [3, 4]. A typical RNA-Seq analysis involves multiple steps and hence the difficulty also arises from the fact that it is imperative to enable consistency of the data formats between the tools used in the adjacent steps. Apart from this, researchers have to write scripts of their own to interpret and parse the intermediate outputs derived from a variety of software. To overcome the barrier, lots of pipeline programs for RNASeq analysis have been developed, including types of remotely hosted and web-based servers and locally installed packages based on a wide variety of programming or coding systems, each of which has its particular strength and advantage. For instance, some tools accept complementary data formats as input, such as Chipster, wapRNA, PRADA, RseqFlow and RobiNA [5–9]. Besides, RSEQtools establishes a model to protect the individual privacy by splitting the total information of alignments into two parts and building a relationship between them for the following in-depth analysis [10]. TRAPLINE constructs the networks of protein–protein and miRNA–target interactions [11]. TCW supports the cross-species transcriptome analysis through the integration of a few evolutionary tools. ArrayExpressHTS and easyRNASeq are chiefly implemented in R language and both of them use R objects to store intermediate data and call the other programs that are inside or outside R environment [12, 13]. NGSUtils, ViennaNGS and S-MART provide a collection of next generation sequencing (NGS) relevant tools that are Dapeng Wang, Department of Plant Sciences, University of Oxford. He is working on the construction and application of a series of pipelines for nextgeneration-sequencing analysis. Submitted: 19 September 2016; Received (in revised form): 12 December 2016 C The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: V 622 Dapeng Wang A pipeline for RNA-Seq analysis Methods Genomic sequences in fasta format and gene annotations in gtf format for two species such as human (hg19) and mouse (mm10) were gleaned from iGenomes database provided by illumina corporation (ftp://ussd-ftp.illumina.com/). Only genes that are defined as protein-coding and located in the complete chromosomes could be retained. Another file composed of the relationship between gene name and transcript name was created for further use. To reduce the total amount of resource data, index files for mapping will be produced only after respective workflow is evoked and will be stored within the workflow folder. Known long noncoding RNA (lncRNA) and repeat annotation files were collected from GENCODE database [27] and UCSC Table Browser [28], respectively. Initially, the pipeline precedes with the raw fastq files generated from the sequencer or commercial repository such as BaseSpace. The quality of reads is evaluated by FastQC (http:// www.bioinformatics.babraham.ac.uk/projects/fastqc/) for both the raw fastq files and processed fastq files. Base quality filter is done by means of PRINSEQ-lite [29] or FASTX-Toolkit (http://han nonlab.cshl.edu/fastx_toolkit/) for paired-end or single-end reads, respectively. Adaptor sequences, where applicable, could be trimmed by Cutadapt [30] in light of library construction methods. The number of unique reads is estimated by fastx_collapser (as part of FASTX-Toolkit) based on sequence identity. In the main module of this pipeline, there are six categories of typical workflows to work on the three steps including mapping, quantification and differentially expressed genes (DEGs) detection: (1) Tophat–Cufflink–Cuffdiff [31–33]; (2) Subread– featureCounts–DESeq2 [34–36]; (3) STAR–RSEM–EBSeq [37–39]; (4) Bowtie–eXpress–edgeR [40–42]; (5) kallisto–sleuth [43, 44]; (6) HISAT–StringTie–Ballgown [45–47] (Figure 1). With the guidance of the methodological designs from different workflows, genome sequences are fed to the mappers for workflow (1), (2), (3) and (6), and transcript sequences are offered to those for workflow (4) and (5). In all cases, the BAM files, either in genomic coordinate or in transcript coordinate, would be generated by the mappers, of which the former is preferred. Two matrix file (...truncated)