Merging of multi-string BWTs with applications
BIOINFORMATICS
Vol. 30 no. 24 2014, pages 3524–3531
doi:10.1093/bioinformatics/btu584
ORIGINAL PAPER
Sequence analysis
Advance Access publication August 28, 2014
Merging of multi-string BWTs with applications
James Holt* and Leonard McMillan
Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599, USA
Associate Editor: Michael Brudno
ABSTRACT
Received on March 18, 2014; revised on June 26, 2014; accepted on
August 25, 2014
1
INTRODUCTION
The throughput of next-generation sequencing (NGS) technologies has increased at such a rate that it is now on the cusp of
outpacing downstream computational and analysis pipelines
(Kahn, 2011). The result is a bottleneck where huge datasets
are held on secondary storage (disk) while awaiting processing.
Raw sequence (e.g. FASTQ) files, composed of sequence and
quality strings, are the most common intermediate piling up at
this bottleneck. Moreover, the rapid development of new analysis tools has led to a culture of archiving raw sequence files for
reanalysis in the future. This hoarding tradition reflects an
entrenched notion that the costs of data generation far exceed
the costs of analysis. The storage overhead of this bottleneck
can be somewhat alleviated through the use of compression.
However, decompression generally requires additional computational stages to decompress datasets before their use, which further impacts the throughput of subsequent analyses. This, in
turn, has led to the need for algorithms that can operate directly
on compressed data (Loh et al., 2012).
*To whom correspondence should be addressed.
3524
Others have previously proposed representing raw sequencing
data in a form that is more compressible and indexed in a way
that is suitable for direct queries by downstream tools (Bauer
et al., 2011; Cox et al., 2012; Mantaci et al., 2005). Our primary
contribution is a method for merging these indexable representations of NGS raw sequence data to increase compressibility
and search through all merged datasets with one query. The
entire collection of reads can be efficiently searched for specific
k-mers and the associated reads recovered.
We leverage a Burrows–Wheeler Transform (BWT) variant
that has been adapted to string collections by Bauer et al.,
2011 for representing raw sequence data. Originally, the BWT
was introduced as an algorithm for permuting a string to
improve its compressibility (Burrows and Wheeler, 1994). The
BWT of a string is closely related to a suffix array for the same
string. In fact, it is merely the concatenation of the symbols
preceding each suffix after those suffixes have been sorted.
A special ‘end of string’ symbol (commonly ‘$’) is used as the
predecessor of the string’s first symbol. The BWT increases string
compressibility because it tends to group similar substrings
together, which creates long runs of identical predecessor symbols. The BWT was exploited by Ferragina and Manzini (2001)
who proposed an FM-index data structure that allows for
searches of the BWT’s implicit suffix array to be performed.
Additionally, these searches were shown to run in O(k) time
where k is the query length, meaning that the BWT’s length
does not affect the query time. Moreover, they showed that the
FM-index can be constructed ‘on-the-fly’ in a single pass over a
string’s BWT. The combination of the BWT and the FM-index
allows large strings to be compressed into a smaller searchable
form. A basic example of the BWT and the associated FM-index
is shown in Table 1.
In bioinformatics, the BWT has proven to be a useful tool for
aligning short reads. The fundamental problem of short-read
alignment is to take small strings and place them along a
larger string such that the edit distance between corresponding
letters is minimized. The BWT is most often used to represent a
reference genome so that it can be searched for smaller substrings. Two prominent aligners, BWA (Li and Durbin, 2009)
and Bowtie (Langmead et al., 2009), take advantage of the BWT
for alignment.
As sequencing and alignment rises in prominence, storing billions of reads on disk has become a common problem. Recently,
several researchers have worked to apply the compression of the
BWT to these large short-read sets. The BWT can be trivially
constructed with multiple strings by simply concatenating them
with a distinguishing breaking symbol as was done by Mantaci
et al. (2005). Multi-string BWTs constructed this way generate
suffixes that combine adjacent strings. Bauer et al. (2011)
ß The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail:
Motivation: The throughput of genomic sequencing has increased to
the point that is overrunning the rate of downstream analysis. This,
along with the desire to revisit old data, has led to a situation where
large quantities of raw, and nearly impenetrable, sequence data are
rapidly filling the hard drives of modern biology labs. These datasets
can be compressed via a multi-string variant of the Burrows–Wheeler
Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute
arbitrary reads as needed. We propose a method for merging such
datasets for both increased compression and downstream analysis.
Results: We present a novel algorithm that merges multi-string BWTs
in OðLCS NÞ time where LCS is the length of their longest common
substring between any of the inputs, and N is the total length of all
inputs combined (number of symbols) using OðN log2 ðFÞÞ bits where
F is the number of multi-string BWTs merged. This merged multi-string
BWT is also shown to have a higher compressibility compared with the
input multi-string BWTs separately. Additionally, we explore some
uses of a merged multi-string BWT for bioinformatics applications.
Availability and implementation: The MSBWT package is available
through PyPI with source code located at https://code.google.com/p/
msbwt/.
Contact:
Merging of multi-string BWTs
Table 1. A sample BWT for the string ‘ACACAC$’
Index
ACACAC$
CACAC$A
ACAC$AC
CAC$ACA
AC$ACAC
C$ACACA
$ACACAC
—
Sorted rotations
$ACACAC
AC$ACAC
ACAC$AC
ACACAC $
C$ACACA
CAC$ACA
CACAC$A
—
BWT
C
C
C
$
A
A
A
—
Counts
$
A
C
0
0
0
0
1
1
1
1
0
0
0
0
0
1
2
3
0
1
2
3
3
3
3
3
Notes: The ‘$’ represents the ‘end-of-string’ symbol, which is lexicographically smaller than all other symbols. All rotations of the string are shown on the left most
column. These rotations are then sorted in the second column. Finally, the BWT is
the concatenation of the predecessor symbols from each sorted rotation (the last
column of the sorted rotations), ‘CCC$AAA’. The occurrence counts of the
FM-index are also shown on the right side of this table. This is a count of the
occurrences of each symbol before (but not including) that index. Given that
the suffixes starting with ‘$’, ‘A’ and ‘C’ start at 0, 1 and 4, respectively (offsets
into the BWT), t (...truncated)