Benchmarking short sequence mapping tools
Ayat Hatem
0
1
Doruk Bozda g
0
Amanda E Toland
2
mit V atalyrek
0
1
0
Department of Biomedical Informatics, The Ohio State University
,
Columbus, OH
,
USA
1
Department of Electrical and Computer Engineering, The Ohio State University
,
Columbus, OH
,
USA
2
Department of Molecular Virology, Immunology and Medical Genetics, The Ohio State University
,
Columbus, OH
,
USA
Background: The development of next-generation sequencing instruments has led to the generation of millions of short sequences in a single run. The process of aligning these reads to a reference genome is time consuming and demands the development of fast and accurate alignment tools. However, the current proposed tools make different compromises between the accuracy and the speed of mapping. Moreover, many important aspects are overlooked while comparing the performance of a newly developed tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools with respect to various aspects and provide an objective comparison. Results: We applied our benchmarking tests on 9 well known mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST (mrFAST) using synthetic data and real RNA-Seq data. MAQ and RMAP are based on building hash tables for the reads, whereas the remaining tools are based on indexing the reference genome. The benchmarking tests reveal the strengths and weaknesses of each tool. The results show that no single tool outperforms all others in all metrics. However, Bowtie maintained the best throughput for most of the tests while BWA performed better for longer read lengths. The benchmarking tests are not restricted to the mentioned tools and can be further applied to others. Conclusion: The mapping process is still a hard problem that is affected by many factors. In this work, we provided a benchmarking suite that reveals and evaluates the different factors affecting the mapping process. Still, there is no tool that outperforms all of the others in all the tests. Therefore, the end user should clearly specify his needs in order to choose the tool that provides the best results.
-
Introduction
Next-generation sequencing (NGS) technology has
evolved rapidly in the last five years, leading to the
generation of hundreds of millions of sequences (reads)
in a single run. The number of generated reads varies
between 1 million for long reads generated by Roche/454
sequencer (400 base pairs (bps)) and 2.4 billion for short
reads generated by Illumina/Solexa and ABI/SOLIDTM
sequencers (75 bps). The invention of the
highthroughput sequencers has led to a significant cost
reduction, e.g., a Megabase of DNA sequence costs only
$0.1 [1].
Nevertheless, the large amount of generated data tells
us almost nothing about the DNA, as stated by Flicek and
Birney [2]. This is due to the lack of proper analysis tools
and algorithms. Therefore, bioinformatics researchers
started to think about new ways to efficiently handle and
analyze this large amount of data.
One of the areas that attracted many researchers to
work on is the alignment (mapping) of the generated
sequences, i.e., the alignment of reads generated by NGS
machines to a reference genome. Because, an efficient
alignment of this large amount of reads with high
accuracy is a crucial part in many applications workflow, such
as genome resequencing [2], DNA methylation [3],
RNASeq [4], ChIP sequencing, SNPs detection [5], genomic
structural variants detection [6], and metagenomics [7].
Therefore, numerous tools have been developed to
undertake this challenging task including MAQ [8], RMAP [9],
GSNAP [10], Bowtie [11], Bowtie2 [12], BWA [13], SOAP2
[14], Mosaik [15], FANGS [16], SHRIMP [17], BFAST [18],
MapReads, SOCS [19], PASS [20], mrFAST [6], mrsFAST
[21], ZOOM [22], Slider [23], SliderII [24], RazerS [25],
RazerS3 [26], and Novoalign [27]. Moreover, GPU-based
tools have been developed to optimally map more reads
such as SARUMAN [28] and SOAP3 [29]. However, due
to using different mapping techniques, each tool provides
different trade-offs between speed and quality of the
mapping. For instance, the quality is often compromised in the
following ways to reduce runtime:
Neglecting base quality score.
Limiting the number of allowed mismatches.
Disabling gapped alignment or limiting the gap
length.
Ignoring SNP information.
In most cases, it is unclear how such compromises affect
the performance of newly developed tools in
comparison to the state of the art ones. Therefore, many studies
have been carried out to provide such comparisons. Some
of the available studies were mainly focused on
providing new tools (e.g., [10,13]). The remaining studies tried
to provide a thorough comparison while each covering a
different aspect (e.g., [30-34]).
For instance, Li and Homer [30] classified the tools into
groups according to the used indexing technique and the
features the tools support such as gapped alignment, long
read alignment, and bisulfite-treated reads alignment. In
other words, in that work, the main focus was
classifying the tools into groups rather than evaluating their
performance on various settings.
Similar to Li and Homer, Fronseca et al. [34] provided
another classification study. However, they included more
tools in the study, around 60 mappers, while being more
focused on providing a comprehensive overview of the
characteristics of the tools.
Ruffalo et al. [32] presented a comparison between
Bowtie, BWA, Novoalign, SHRiMP, mrFAST, mrsFAST,
and SOAP2. Unlike the above mentioned studies,
Ruffalo et al. evaluated the accuracy of the tools in different
settings. They defined a read to be correctly mapped if
it maps to the correct location in the genome and has
a quality score higher than or equal to the threshold.
Accordingly, they evaluated the behavior of the tools while
varying the sequencing error rate, indel size, and indel
frequency. However, they used the default options of the
mapping tools in most of the experiments. In addition,
they considered small simulated data sets of 500,000 reads
of length 50 bps while using an artificial genome of length
500Mbp and the Human genome of length 3Gbp as the
reference genomes.
Another study was done by Holtgrewe et al. [31], where
the focus was the sensitivity of the tools. They enumerated
the possible matching intervals with a maximum distance
k for each read. Afterwards, they evaluated the
sensitivity of the mappers according to the number of intervals
they detected. Holtgrewe et al. used the suggested
sensitivity evaluation criteria to evaluate the performance of
SOAP2, Bowtie, BWA, and Shrimp2 on both simulated
and real datasets. However, they used small reference
genomes (the S. cerevisiae genome of length 12 Mbp and
the D. melanogaster genome of length 169 Mbp). In
addition, the experiments were performed on sm (...truncated)