A New Class of Searchable and Provably Highly Compressible String Transformations

Leibniz International Proceedings in Informatics, Jun 2019

The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the "myriad virtues" of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words.

Article PDF cannot be displayed. You can download it here:

http://drops.dagstuhl.de/opus/volltexte/2019/10483/pdf/LIPIcs-CPM-2019-12.pdf

A New Class of Searchable and Provably Highly Compressible String Transformations

A New Class of Searchable and Provably Highly Compressible String Transformations Raffaele Giancarlo University of Palermo, Dipartimento di Matematica e Informatica, Italy Giovanni Manzini University of Eastern Piedmont, Alessandria, Italy IIT-CNR, Pisa, Italy Giovanna Rosone University of Pisa, Dipartimento di Informatica, Italy Marinella Sciortino University of Palermo, Dipartimento di Matematica e Informatica, Italy Abstract The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the “myriad virtues” of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words. 2012 ACM Subject Classification Theory of computation → Data compression; Mathematics of computing → Combinatorial algorithms Keywords and phrases Data Indexing and Compression, Burrows-Wheeler Transformation, Combinatorics on Words Digital Object Identifier 10.4230/LIPIcs.CPM.2019.12 Funding GR and SM are partially supported by MIUR-SIR project CMACBioSeq “Combinatorial methods for analysis and compression of biological sequences” grant n. RBSI146R5L; RG and GM are partially supported by INdAM-GNCS project 2018 “Innovative methods for the solution of medical and biological big data” and MIUR-PRIN project “Multicriteria Data Structures and Algorithms: from compressed to learned indexes, and beyond” grant n. 2017WR7SHH. 1 Introduction The Burrows Wheeler Transform [2] (BWT) is a string transformation that had a revolutionary impact in the design of succinct or compressed data structures. Originally proposed as a tool for text compression, shortly after its introduction [9] it has been shown that, in addition to making easier to represent a string in space close to its entropy, it also makes easier to search for pattern occurrences in the original string. After this discovery, data transformations inspired by the BWT have been proposed for compactly representing and search other © Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, and Marinella Sciortino; licensed under Creative Commons License CC-BY 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Editors: Nadia Pisanti and Solon P. Pissis; Article No. 12; pp. 12:1–12:12 Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany 12:2 A New Class of Searchable and Highly Compressible String Transformations combinatorial objects such as: trees, graphs, finite automata, and even string alignments. See [11] for an attempt to unify some of these results and [25] for an in-depth treatment of the field of compact data structures. Going back to the original Burrows-Wheeler string transformation, we can summarize its salient features as follows: 1) it can be computed and inverted in linear time, 2) it produces strings which are provably compressible in terms of the high order entropy of the input, 3) it supports pattern search directly on the transformed string in time proportional to the pattern length. It is the combination of these three properties that makes the BWT a fundamental tool for the design of compressed self-indices. In Section 2 we review these properties and also the many attempts to modify the original design. However, we recall that, despite more than twenty years of intense scrutiny, the only non trivial known BWT variant that fully satisfies properties 1–3 is the Alternating BWT (ABWT). The ABWT has been introduced in [13] in the field of combinatorics of words and its basic algorithmic properties have been described in [15]. In this paper we introduce a new whole family of transformations that satisfy properties 1–3 and can therefore replace the BWT in the construction of compressed self-indices with the same time efficiency of the original BWT and the potential of achieving better compression. We show that our family, supporting linear time computation, inversion, and search, is a special case of a much larger class of transformations that also satisfy properties 1–3 except that, in the general case, inversion and pattern search may take quadratic time. Our larger class includes as special cases also the BWT and the ABWT and therefore it constitutes a natural candidate for the study of additional properties shared by all known BWT variants. More in detail, in Section 3 we describe a class of string transformations based on context adaptive alphabet orderings. The main feature of the above class of transformations is that, in the rotation sorting phase, we use alphabet orderings that depend on the context (i.e., the longest common prefix of the rotations being compared). In Section 4 we consider the subclass of transformations based on local orderings. In this subclass, the alphabet orderings only depend on a constant portion of the context. We prove that local ordering transformations can be inverted in linear time, and that pattern search in the transformed string takes time proportional to the pattern length. Thus, these transformations have the same properties 1–3 that were so far prerogative of the BWT and ABWT. Having now at our disposal a wide class of string transformations with the same remarkable properties of the BWT, it is natural to use them to improve BWT-based data structures by selecting the one more suitable for the task. In this paper we initiate this study by considering the problem of selecting the BWT variant that minimizes the number of runs in the transformed string. The motivation is that data centers often store highly repetitive collections, such as genome databases, source code repositories, and versioned text collections. For such highly repetitive collections there is theoretical and practical evidence that the entropy underestimates the compressibility of the collection and much better compression ratios are obtained exploiting runs of equal symbols in the BWT [4, 12, 18, 19, 21, 22, 23]. In Section 5 we show that, for constant size alphabet, for the most general class of transformations considered in this paper, the BWT variant that minimizes the number of runs can be found in linear time using a dynamic programming al (...truncated)


This is a preview of a remote PDF: http://drops.dagstuhl.de/opus/volltexte/2019/10483/pdf/LIPIcs-CPM-2019-12.pdf
Article home page: http://drops.dagstuhl.de/opus/frontdoor.php?source_opus=10483

Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, Marinella Sciortino. A New Class of Searchable and Provably Highly Compressible String Transformations, Leibniz International Proceedings in Informatics, 2019, pp. 12:1-12:12, 128, DOI: 10.4230/LIPIcs.CPM.2019.12