A New Class of Searchable and Provably Highly Compressible String Transformations
A New Class of Searchable and Provably Highly
Compressible String Transformations
Raffaele Giancarlo
University of Palermo, Dipartimento di Matematica e Informatica, Italy
Giovanni Manzini
University of Eastern Piedmont, Alessandria, Italy
IIT-CNR, Pisa, Italy
Giovanna Rosone
University of Pisa, Dipartimento di Informatica, Italy
Marinella Sciortino
University of Palermo, Dipartimento di Matematica e Informatica, Italy
Abstract
The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the
design of self-indexing compressed data structures. Over the years, researchers have successfully
extended this transformation outside the domains of strings. However, efforts to find non-trivial
alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met
limited success. In this paper we bring new lymph to this area by introducing a whole new family of
transformations that have all the “myriad virtues” of the BWT: they can be computed and inverted
in linear time, they produce provably highly compressible strings, and they support linear time
pattern search directly on the transformed string. This new family is a special case of a more general
class of transformations based on context adaptive alphabet orderings, a concept introduced here.
This more general class includes also the Alternating BWT, another invertible string transforms
recently introduced in connection with a generalization of Lyndon words.
2012 ACM Subject Classification Theory of computation → Data compression; Mathematics of
computing → Combinatorial algorithms
Keywords and phrases Data Indexing and Compression, Burrows-Wheeler Transformation, Combinatorics on Words
Digital Object Identifier 10.4230/LIPIcs.CPM.2019.12
Funding GR and SM are partially supported by MIUR-SIR project CMACBioSeq “Combinatorial
methods for analysis and compression of biological sequences” grant n. RBSI146R5L; RG and GM are
partially supported by INdAM-GNCS project 2018 “Innovative methods for the solution of medical
and biological big data” and MIUR-PRIN project “Multicriteria Data Structures and Algorithms:
from compressed to learned indexes, and beyond” grant n. 2017WR7SHH.
1
Introduction
The Burrows Wheeler Transform [2] (BWT) is a string transformation that had a revolutionary
impact in the design of succinct or compressed data structures. Originally proposed as a tool
for text compression, shortly after its introduction [9] it has been shown that, in addition to
making easier to represent a string in space close to its entropy, it also makes easier to search
for pattern occurrences in the original string. After this discovery, data transformations
inspired by the BWT have been proposed for compactly representing and search other
© Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, and Marinella Sciortino;
licensed under Creative Commons License CC-BY
30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019).
Editors: Nadia Pisanti and Solon P. Pissis; Article No. 12; pp. 12:1–12:12
Leibniz International Proceedings in Informatics
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
12:2
A New Class of Searchable and Highly Compressible String Transformations
combinatorial objects such as: trees, graphs, finite automata, and even string alignments.
See [11] for an attempt to unify some of these results and [25] for an in-depth treatment of
the field of compact data structures.
Going back to the original Burrows-Wheeler string transformation, we can summarize its
salient features as follows: 1) it can be computed and inverted in linear time, 2) it produces
strings which are provably compressible in terms of the high order entropy of the input, 3) it
supports pattern search directly on the transformed string in time proportional to the pattern
length. It is the combination of these three properties that makes the BWT a fundamental
tool for the design of compressed self-indices. In Section 2 we review these properties and
also the many attempts to modify the original design. However, we recall that, despite more
than twenty years of intense scrutiny, the only non trivial known BWT variant that fully
satisfies properties 1–3 is the Alternating BWT (ABWT). The ABWT has been introduced
in [13] in the field of combinatorics of words and its basic algorithmic properties have been
described in [15].
In this paper we introduce a new whole family of transformations that satisfy properties
1–3 and can therefore replace the BWT in the construction of compressed self-indices with the
same time efficiency of the original BWT and the potential of achieving better compression.
We show that our family, supporting linear time computation, inversion, and search, is a
special case of a much larger class of transformations that also satisfy properties 1–3 except
that, in the general case, inversion and pattern search may take quadratic time. Our larger
class includes as special cases also the BWT and the ABWT and therefore it constitutes a
natural candidate for the study of additional properties shared by all known BWT variants.
More in detail, in Section 3 we describe a class of string transformations based on context
adaptive alphabet orderings. The main feature of the above class of transformations is that,
in the rotation sorting phase, we use alphabet orderings that depend on the context (i.e., the
longest common prefix of the rotations being compared). In Section 4 we consider the subclass
of transformations based on local orderings. In this subclass, the alphabet orderings only
depend on a constant portion of the context. We prove that local ordering transformations
can be inverted in linear time, and that pattern search in the transformed string takes time
proportional to the pattern length. Thus, these transformations have the same properties
1–3 that were so far prerogative of the BWT and ABWT.
Having now at our disposal a wide class of string transformations with the same remarkable
properties of the BWT, it is natural to use them to improve BWT-based data structures
by selecting the one more suitable for the task. In this paper we initiate this study by
considering the problem of selecting the BWT variant that minimizes the number of runs
in the transformed string. The motivation is that data centers often store highly repetitive
collections, such as genome databases, source code repositories, and versioned text collections.
For such highly repetitive collections there is theoretical and practical evidence that the
entropy underestimates the compressibility of the collection and much better compression
ratios are obtained exploiting runs of equal symbols in the BWT [4, 12, 18, 19, 21, 22, 23]. In
Section 5 we show that, for constant size alphabet, for the most general class of transformations
considered in this paper, the BWT variant that minimizes the number of runs can be found
in linear time using a dynamic programming al (...truncated)