Merging of multi-string BWTs with applications (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/30/24/3524/48931354/bioinformatics_30_24_3524.pdf

Merging of multi-string BWTs with applications

BIOINFORMATICS Vol. 30 no. 24 2014, pages 3524–3531 doi:10.1093/bioinformatics/btu584 ORIGINAL PAPER Sequence analysis Advance Access publication August 28, 2014 Merging of multi-string BWTs with applications James Holt* and Leonard McMillan Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599, USA Associate Editor: Michael Brudno ABSTRACT Received on March 18, 2014; revised on June 26, 2014; accepted on August 25, 2014 1 INTRODUCTION The throughput of next-generation sequencing (NGS) technologies has increased at such a rate that it is now on the cusp of outpacing downstream computational and analysis pipelines (Kahn, 2011). The result is a bottleneck where huge datasets are held on secondary storage (disk) while awaiting processing. Raw sequence (e.g. FASTQ) files, composed of sequence and quality strings, are the most common intermediate piling up at this bottleneck. Moreover, the rapid development of new analysis tools has led to a culture of archiving raw sequence files for reanalysis in the future. This hoarding tradition reflects an entrenched notion that the costs of data generation far exceed the costs of analysis. The storage overhead of this bottleneck can be somewhat alleviated through the use of compression. However, decompression generally requires additional computational stages to decompress datasets before their use, which further impacts the throughput of subsequent analyses. This, in turn, has led to the need for algorithms that can operate directly on compressed data (Loh et al., 2012). *To whom correspondence should be addressed. 3524 Others have previously proposed representing raw sequencing data in a form that is more compressible and indexed in a way that is suitable for direct queries by downstream tools (Bauer et al., 2011; Cox et al., 2012; Mantaci et al., 2005). Our primary contribution is a method for merging these indexable representations of NGS raw sequence data to increase compressibility and search through all merged datasets with one query. The entire collection of reads can be efficiently searched for specific k-mers and the associated reads recovered. We leverage a Burrows–Wheeler Transform (BWT) variant that has been adapted to string collections by Bauer et al., 2011 for representing raw sequence data. Originally, the BWT was introduced as an algorithm for permuting a string to improve its compressibility (Burrows and Wheeler, 1994). The BWT of a string is closely related to a suffix array for the same string. In fact, it is merely the concatenation of the symbols preceding each suffix after those suffixes have been sorted. A special ‘end of string’ symbol (commonly ‘$’) is used as the predecessor of the string’s first symbol. The BWT increases string compressibility because it tends to group similar substrings together, which creates long runs of identical predecessor symbols. The BWT was exploited by Ferragina and Manzini (2001) who proposed an FM-index data structure that allows for searches of the BWT’s implicit suffix array to be performed. Additionally, these searches were shown to run in O(k) time where k is the query length, meaning that the BWT’s length does not affect the query time. Moreover, they showed that the FM-index can be constructed ‘on-the-fly’ in a single pass over a string’s BWT. The combination of the BWT and the FM-index allows large strings to be compressed into a smaller searchable form. A basic example of the BWT and the associated FM-index is shown in Table 1. In bioinformatics, the BWT has proven to be a useful tool for aligning short reads. The fundamental problem of short-read alignment is to take small strings and place them along a larger string such that the edit distance between corresponding letters is minimized. The BWT is most often used to represent a reference genome so that it can be searched for smaller substrings. Two prominent aligners, BWA (Li and Durbin, 2009) and Bowtie (Langmead et al., 2009), take advantage of the BWT for alignment. As sequencing and alignment rises in prominence, storing billions of reads on disk has become a common problem. Recently, several researchers have worked to apply the compression of the BWT to these large short-read sets. The BWT can be trivially constructed with multiple strings by simply concatenating them with a distinguishing breaking symbol as was done by Mantaci et al. (2005). Multi-string BWTs constructed this way generate suffixes that combine adjacent strings. Bauer et al. (2011) ß The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: Motivation: The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows–Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis. Results: We present a novel algorithm that merges multi-string BWTs in OðLCS NÞ time where LCS is the length of their longest common substring between any of the inputs, and N is the total length of all inputs combined (number of symbols) using OðN log2 ðFÞÞ bits where F is the number of multi-string BWTs merged. This merged multi-string BWT is also shown to have a higher compressibility compared with the input multi-string BWTs separately. Additionally, we explore some uses of a merged multi-string BWT for bioinformatics applications. Availability and implementation: The MSBWT package is available through PyPI with source code located at https://code.google.com/p/ msbwt/. Contact: Merging of multi-string BWTs Table 1. A sample BWT for the string ‘ACACAC$’ Index ACACAC$ CACAC$A ACAC$AC CAC$ACA AC$ACAC C$ACACA $ACACAC — Sorted rotations $ACACAC AC$ACAC ACAC$AC ACACAC $ C$ACACA CAC$ACA CACAC$A — BWT C C C $ A A A — Counts $ A C 0 0 0 0 1 1 1 1 0 0 0 0 0 1 2 3 0 1 2 3 3 3 3 3 Notes: The ‘$’ represents the ‘end-of-string’ symbol, which is lexicographically smaller than all other symbols. All rotations of the string are shown on the left most column. These rotations are then sorted in the second column. Finally, the BWT is the concatenation of the predecessor symbols from each sorted rotation (the last column of the sorted rotations), ‘CCC$AAA’. The occurrence counts of the FM-index are also shown on the right side of this table. This is a count of the occurrences of each symbol before (but not including) that index. Given that the suffixes starting with ‘$’, ‘A’ and ‘C’ start at 0, 1 and 4, respectively (offsets into the BWT), t (...truncated)