Halvade: scalable sequence analysis with MapReduce

Bioinformatics, Jul 2015

Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading. Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR. Its source is available at http://bioinformatics.intec.ugent.be/halvade under GPL license. Contact: jan.fostier{at}intec.ugent.be Supplementary information: Supplementary data are available at Bioinformatics online.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://bioinformatics.oxfordjournals.org/content/31/15/2482.full.pdf

Halvade: scalable sequence analysis with MapReduce

Bioinformatics Halvade: scalable sequence analysis with MapReduce Dries Decap 1 2 Joke Reumers 0 1 Charlotte Herzeel 1 4 Pascal Costanza 1 3 Jan Fostier 1 2 0 Janssen Research & Development, a division of Janssen Pharmaceutica N.V. , 2340 Beerse , Belgium 1 ExaScience Life Lab , Kapeldreef 75, 3001 Leuven , Belgium 2 Department of Information Technology, Ghent University - iMinds , Gaston Crommenlaan 8 bus 201, 9050 Ghent , Belgium 3 Intel Corporation Belgium 4 Imec , Kapeldreef 75, 3001 Leuven , Belgium Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50 coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading. Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR. Its source is available at http://bioinformatics.intec.ugent.be/halvade under GPL license. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction The speed of DNA sequencing has increased considerably with the introduction of next-generation sequencing platforms. For example, modern Illumina systems can generate several hundreds of gigabases per run (Zhang et al., 2011) with a high accuracy. This, in turn, gives rise to several hundreds of gigabytes of raw sequence data to be processed. Post-sequencing DNA analysis typically consists of two major phases: (i) alignment of reads to a reference genome and (ii) variant calling, i.e. the identification of differences between the reference genome and the genome from which the reads were sequenced. For both tasks, numerous tools have been described in literature, see e.g. Fonseca et al. (2012) and Nielsen et al. (2011) for an overview. Especially for whole genome sequencing, applying such tools is a computational bottleneck. To illustrate this, we consider the recently proposed Best Practices pipeline for DNA sequencing analysis (Van der Auwera et al., 2013) that consists of the Burrow-Wheeler Aligner (BWA) (Li and Durbin, 2009) for the alignment step, Picard (http://picard.sourceforge.net) for data preparation and the Genome Analysis Toolkit (GATK) (Depristo et al., 2011; McKenna et al., 2010) for variant calling. On a single node, the execution of this pipeline consumes more time than the sequencing step itself: a dataset consisting of 1.5 billion paired-end reads (Illumina Platinum genomes, NA12878, 100 bp, 50-fold coverage, human genome) requires over 12 days using a single CPU core of a 24-core machine (dual socket Intel Xeon E5-2695 v2 @ 2.40 GHz): 172 h for the alignment phase, 35 h for data preparation (Picard steps) and 80 h for GATK, including local read realignment, base quality score recalibration and variant calling. When allowing the involved tools to run multithreaded on the same machine, the runtime decreases only by a factor of roughly 2.5 to 5 days, indicative of a poor scaling behavior in some of the steps of the pipeline. To overcome this bottleneck, we developed Halvade, a modular framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure. It is based on the simple observation that read mapping is parallel by read, i.e. the alignment of a certain read is independent of the alignment of another read. Similarly, variant calling is conceptually parallel by chromosomal region, e.g. variant calling in a certain chromosomal region is independent of variant calling in a different region. Therefore, multiple instances of a tool can be run in parallel on a subset of the data. Halvade relies on the MapReduce programming model (Dean and Ghemawat, 2008) to execute tasks concurrently, both within and across compute nodes. The map phase corresponds to the read mapping step while variant calling is performed during the reduce phase. In between both phases, aligned reads are sorted in parallel according to genomic position. By making use of the aggregated compute power of multiple machines, Halvade is able to strongly reduce the runtime for post-sequencing analysis. A key feature of Halvade is th (...truncated)


This is a preview of a remote PDF: https://bioinformatics.oxfordjournals.org/content/31/15/2482.full.pdf

Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, Jan Fostier. Halvade: scalable sequence analysis with MapReduce, Bioinformatics, 2015, pp. 2482-2488, 31/15, DOI: 10.1093/bioinformatics/btv179