Halvade: scalable sequence analysis with MapReduce
Bioinformatics
Halvade: scalable sequence analysis with MapReduce
Dries Decap 1 2
Joke Reumers 0 1
Charlotte Herzeel 1 4
Pascal Costanza 1 3
Jan Fostier 1 2
0 Janssen Research & Development, a division of Janssen Pharmaceutica N.V. , 2340 Beerse , Belgium
1 ExaScience Life Lab , Kapeldreef 75, 3001 Leuven , Belgium
2 Department of Information Technology, Ghent University - iMinds , Gaston Crommenlaan 8 bus 201, 9050 Ghent , Belgium
3 Intel Corporation Belgium
4 Imec , Kapeldreef 75, 3001 Leuven , Belgium
Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50 coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading. Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR. Its source is available at http://bioinformatics.intec.ugent.be/halvade under GPL license. Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
The speed of DNA sequencing has increased considerably with the
introduction of next-generation sequencing platforms. For example,
modern Illumina systems can generate several hundreds of gigabases
per run (Zhang et al., 2011) with a high accuracy. This, in turn,
gives rise to several hundreds of gigabytes of raw sequence data to
be processed.
Post-sequencing DNA analysis typically consists of two major
phases: (i) alignment of reads to a reference genome and (ii) variant
calling, i.e. the identification of differences between the reference
genome and the genome from which the reads were sequenced.
For both tasks, numerous tools have been described in literature,
see e.g. Fonseca et al. (2012) and Nielsen et al. (2011) for an
overview. Especially for whole genome sequencing, applying such tools
is a computational bottleneck. To illustrate this, we consider the
recently proposed Best Practices pipeline for DNA sequencing analysis
(Van der Auwera et al., 2013) that consists of the Burrow-Wheeler
Aligner (BWA) (Li and Durbin, 2009) for the alignment step, Picard
(http://picard.sourceforge.net) for data preparation and the Genome
Analysis Toolkit (GATK) (Depristo et al., 2011; McKenna et al.,
2010) for variant calling. On a single node, the execution of this
pipeline consumes more time than the sequencing step itself: a
dataset consisting of 1.5 billion paired-end reads (Illumina Platinum
genomes, NA12878, 100 bp, 50-fold coverage, human genome)
requires over 12 days using a single CPU core of a 24-core machine
(dual socket Intel Xeon E5-2695 v2 @ 2.40 GHz): 172 h for the
alignment phase, 35 h for data preparation (Picard steps) and 80 h
for GATK, including local read realignment, base quality score
recalibration and variant calling. When allowing the involved tools to
run multithreaded on the same machine, the runtime decreases only
by a factor of roughly 2.5 to 5 days, indicative of a poor scaling
behavior in some of the steps of the pipeline.
To overcome this bottleneck, we developed Halvade, a modular
framework that enables sequencing pipelines to be executed in
parallel on a multi-node and/or multi-core compute infrastructure. It is
based on the simple observation that read mapping is parallel by
read, i.e. the alignment of a certain read is independent of the
alignment of another read. Similarly, variant calling is conceptually
parallel by chromosomal region, e.g. variant calling in a certain
chromosomal region is independent of variant calling in a different
region. Therefore, multiple instances of a tool can be run in parallel
on a subset of the data. Halvade relies on the MapReduce
programming model (Dean and Ghemawat, 2008) to execute tasks
concurrently, both within and across compute nodes. The map phase
corresponds to the read mapping step while variant calling is
performed during the reduce phase. In between both phases, aligned
reads are sorted in parallel according to genomic position. By
making use of the aggregated compute power of multiple machines,
Halvade is able to strongly reduce the runtime for post-sequencing
analysis. A key feature of Halvade is th (...truncated)