Relative Suffix Trees (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/comjnl/article-pdf/61/5/773/24724637/bxx108.pdf

Relative Suffix Trees

© The British Computer Society 2017. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Advance Access publication on 21 November 2017 doi:10.1093/comjnl/bxx108 Relative Sufﬁx Trees ANDREA FARRUGGIA1, TRAVIS GAGIE2,3, GONZALO NAVARRO2,4*, SIMON J. PUGLISI5 AND JOUNI SIRÉN6 1 Department of Computer Science, University of Pisa, Largo Bruno Pontecorvo 3, 56127 Pisa PI, Italy 2 CeBiB—Center for Biotechnology and Bioengineering, Santiago, Chile 3 Escuela de Informática y Telecomunicaciones, Diego Portales University, Ejército 441, Santiago, Chile 4 Department of Computer Science, University of Chile, Beauchef 851, Santiago, Chile 5 Department of Computer Science, University of Helsinki, Helsinki, Finland 6 Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK * Corresponding author: Sufﬁx trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed sufﬁx trees. These compressed data structures can efﬁciently simulate the sufﬁx tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed sufﬁx trees for repetitive sequence collections, such as collections of individual genomes. We compress the sufﬁx trees of individual sequences relative to the sufﬁx tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed sufﬁx trees for repetitive collections, and competitive in time with the largest and fastest compressed sufﬁx trees. Keywords: sufﬁx trees; compressed text indexing; repetitive collections Received 12 May 2017; revised 1 September 2017; editorial decision 16 October 2017; Handling editor: Raphael Clifford 1. INTRODUCTION The sufﬁx tree [1] is one of the most powerful bioinformatic tools to answer complex queries on DNA and protein sequences [2–4]. A serious problem that hampers its wider use on large genome sequences is its size, which may be 10–20 bytes per character. In addition, the non-local access patterns required by most interesting problems solved with sufﬁx trees complicate secondary-memory deployments. This problem has led to numerous efforts to reduce the size of sufﬁx trees by representing them using compressed data structures [5–17], leading to compressed sufﬁx trees (CST). Currently, the smallest CST is the so-called fully compressed sufﬁx tree (FCST) [10, 14], which uses 5 bits per character (bpc) for DNA sequences, but takes milliseconds to simulate sufﬁx tree navigation operations. In the other extreme, Sadakane’s CST [5, 11] uses about 12 bpc and operates in microseconds, and even nanoseconds for the simplest operations. A space usage of 12 bpc may seem reasonable to handle, for example, one human genome, which has about 3.1 billion bases: it can be operated within a RAM of 4.5 GB (the representation contains the sequence as well). However, as the price of sequencing has fallen, sequencing the genomes of a large number of individuals has become a routine activity. The 1000 Genomes Project [18] sequenced the genomes of several thousand humans, while newer projects can be orders of magnitude larger. This has made the development of techniques for storing and analyzing huge amounts of sequence data ﬂourish. Just storing 1000 human genomes using a 12 bpc CST requires almost 4.5 TB, which is much more than the amount of memory available in a commodity server. Assuming that a single server has 256 GB of memory, we would need a cluster of 18 servers to handle such a collection of CSTs (compared with over 100 with classical sufﬁx tree implementations!). With the smaller (and much slower) FCST, this would drop to 7–8 servers. It is clear that further space reductions in the representation of CST would lead to reductions in hardware, communication and energy costs when implementing complex searches over large genomic databases. An important characteristic of those large genome databases is that they usually consist of the genomes of individuals of the same or closely related species. This implies that the collections are highly repetitive, that is, each genome can SECTION A: COMPUTER SCIENCE THEORY, METHODS AND TOOLS THE COMPUTER JOURNAL, VOL. 61 NO. 5, 2018 774 A. FARRUGGIA et al. be obtained by concatenating a relatively small number of substrings of other genomes and adding a few new characters. When repetitiveness is considered, much higher compression rates can be obtained in CST. For example, it is possible to reduce the space to 1–2 bpc (albeit with operation times in the milliseconds) [13], or to 2–3 bpc with operation times in the microseconds [15]. Using 2 bpc, our 1000 genomes could be handled with just three servers with 256 GB of memory. Compression algorithms best capture repetitiveness by using grammar-based compression or Lempel–Ziv compression.1 In the ﬁrst case [19, 20], one ﬁnds a context-free grammar that generates (only) the text collection. Rather than compressing the text directly, the current CSTs for repetitive collections [13, 15] apply grammar-based compression on the data structures that simulate the sufﬁx tree. Grammarbased compression yields relatively easy direct access to the compressed sequence [21], which makes it attractive compared to Lempel–Ziv compression [22], despite the latter generally using less space. Lempel–Ziv compression cuts the collection into phrases, each of which has already appeared earlier in the collection. To extract the content of a phrase, one may have to recursively extract the content at that earlier position, following a possibly long chain of indirections. So far, the indexes built on Lempel–Ziv compression [23] or on combinations of Lempel–Ziv and grammar-based compression [24–26] support only pattern matching, which is just one of the wide range of functionalities offered by sufﬁx trees. The high cost to access the data at random positions lies at the heart of the research on indexes built on Lempel–Ziv compression. A simple way out of this limitation is the so-called relative Lempel–Ziv (RLZ) compression [27], where one of the sequences is represented in plain form and the others can only take phrases from that reference sequence. This enables immediate access for the symbols inside any copied phrase (as no transitive referencing exists) and, at least, if a good reference sequence has been found, offers compression competitive with the classical Lempel–Ziv. In our case, taking any random genome per species as the reference is good enough; more sophisticated techniques hav (...truncated)