MAP-RSeq: Mayo Analysis Pipeline for RNA sequencing
Krishna R Kalari
0
Asha A Nair
0
Jaysheel D Bhavsar
0
Daniel R O'Brien
0
Jaime I Davila
0
Matthew A Bockol
0
Jinfu Nie
0
Xiaojia Tang
0
Saurabh Baheti
0
Jay B Doughty
0
Sumit Middha
0
Hugues Sicotte
0
Aubrey E Thompson
3
Yan W Asmann
2
Jean-Pierre A Kocher
0
1
0
Department of Health Sciences Research, Mayo Clinic
,
200 First Street SW, Rochester, MN 55905
,
USA
1
Present Address: Department of Health Sciences Research, Mayo Clinic
,
200 First Street SW, Rochester, MN 55905
,
USA
2
Department of Health Sciences Research, Mayo Clinic
,
4500 San Pablo Road, Jacksonville, FL 32224
,
USA
3
Department of Cancer Biology, Mayo Clinic
,
4500 San Pablo Road, Jacksonville, FL 32224
,
USA
Background: Although the costs of next generation sequencing technology have decreased over the past years, there is still a lack of simple-to-use applications, for a comprehensive analysis of RNA sequencing data. There is no one-stop shop for transcriptomic genomics. We have developed MAP-RSeq, a comprehensive computational workflow that can be used for obtaining genomic features from transcriptomic sequencing data, for any genome. Results: For optimization of tools and parameters, MAP-RSeq was validated using both simulated and real datasets. MAP-RSeq workflow consists of six major modules such as alignment of reads, quality assessment of reads, gene expression assessment and exon read counting, identification of expressed single nucleotide variants (SNVs), detection of fusion transcripts, summarization of transcriptomics data and final report. This workflow is available for Human transcriptome analysis and can be easily adapted and used for other genomes. Several clinical and research projects at the Mayo Clinic have applied the MAP-RSeq workflow for RNA-Seq studies. The results from MAP-RSeq have thus far enabled clinicians and researchers to understand the transcriptomic landscape of diseases for better diagnosis and treatment of patients. Conclusions: Our software provides gene counts, exon counts, fusion candidates, expressed single nucleotide variants, mapping statistics, visualizations, and a detailed research data report for RNA-Seq. The workflow can be executed on a standalone virtual machine or on a parallel Sun Grid Engine cluster. The software can be downloaded from http://bioinformaticstools.mayo.edu/research/maprseq/.
-
Background
Next generation sequencing (NGS) technology
breakthroughs have allowed us to define the transcriptomic
landscape for cancers and other diseases [1].
RNASequencing (RNA-Seq) is information-rich; it enables
researchers to investigate a variety of genomic features,
such as gene expression, characterization of novel
transcripts, alternative splice sites, single nucleotide variants
(SNVs), fusion transcripts, long non-coding RNAs, small
insertions, and small deletions. Multiple alignment
software packages are available for read alignment, quality
control methods, gene expression and transcript
quantification methods for RNA-Seq [2-5]. However, the majority
of the RNA-Seq bioinformatics methods are focused only
on the analysis of a few genomic features for downstream
analysis [6-9]. At present there is no comprehensive
RNA-Seq workflow that can simply be installed and
used for multiple genomic feature analysis. At the Mayo
Clinic, we have developed MAP-RSeq - a comprehensive
computational workflow, to align, assess and report
multiple genomic features from paired-end RNA-Seq
data efficiently with a quick turnaround time. We have
tested a variety of tools and methods to accurately
estimate genomic features from RNA-Seq data. Best
performing publically available bioinformatics tools along
with parameter optimization were included in our
workflow. As needed we have integrated in-house methods
or tools to fill in the gaps. We have thoroughly investigated
and compared the available tools and have optimized
parameters to make the workflow run seamlessly for
both virtual machine and cluster environments. Our
software has been tested with paired-end sequencing reads
from all Illumina platforms. Thus far, we have processed
1,535 Mayo Clinic samples using the MAP-RSeq
workflow. The MAP-RSeq research reports for RNA-Seq data
have enabled Mayo Clinic researchers and clinicians to
exchange datasets and findings. Standardizing the workflow
has allowed us to build a system that enables us to
investigate across multiple studies within the Mayo Clinic.
MAP-RSeq is a production application that allows
researchers with minimal expertise in LINUX or Windows
to install, analyze and interpret RNA-Seq data.
Implementation
MAP-RSeq uses a variety of freely available bioinformatics
tools along with in-house developed methods using Perl,
Python, R, and Java. MAP-RSeq is available in two versions.
The first version is single threaded and runs on a virtual
machine (VM). The VM version is straightforward to
install. The second version is multi-threaded and is
designed to run on a cluster environment.
Virtual machine
Virtual machine version of MAP-RSeq is available for
download at the following URL [10]. This includes a
sample dataset, references (limited to chromosome 22),
and the complete MAP-RSeq workflow pre-installed.
Virtual Box software (free for Windows, Mac, and Linux
at [11]) needs to be installed in the host system. The
system also needs to meet the following requirements:
at least 4GB of physical memory, and at least 10GB of
available disk. Although our sample data is only from
Human Chromosome 22, this virtual machine can be
extended to the entire human reference genome or to
Table 1 MAP-RSeq installation and run time for
QuickStart virtual machine
Time to import into VM
Run time with sample
data (chr22 only)
~ 20 minutes to download
on consumer grade internet
~10 minutes to download on
consumer grade internet
~6 hours (mostly downloading
and indexing references)
Depends on the sample data used
other species. However this requires allocating more
memory (~16GB) than may be available on a typical
desktop system and building the index references files
for the species of interest.
Tables 1 and 2 shows the install and run time metrics
of MAP-RSeq in virtual machine and Linux environments
respectively. For Table 2, we downloaded the breast cancer
cell line data from CGHub [12] and randomly chose 4
million reads to run through the QuickStart VM. It took
6 hours for the MAP-RSeq workflow to complete. It did
not exceed the 4GB memory limit, but did rely heavily on
the swap space provided; making it run slower than if it
would have had more physical memory available. Job
profiling indicates that the system could have used 11GB of
memory for such a sample.
Sun grid engine
MAP-RSeq requires four processing cores with a total of
16GB RAM to get optimal performance. It also requires
8GB of storage space for tools and reference file
installation. For MAP-RSeq execution the following packages
such as JAVA version 1.6.0_17 or higher, Perl version
5.10.0 or higher, Python version (...truncated)