Relative Suffix Trees
© The British Computer Society 2017.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution,
and reproduction in any medium, provided the original work is properly cited.
Advance Access publication on 21 November 2017
doi:10.1093/comjnl/bxx108
Relative Suffix Trees
ANDREA FARRUGGIA1, TRAVIS GAGIE2,3, GONZALO NAVARRO2,4*,
SIMON J. PUGLISI5 AND JOUNI SIRÉN6
1
Department of Computer Science, University of Pisa, Largo Bruno Pontecorvo 3, 56127 Pisa PI, Italy
2
CeBiB—Center for Biotechnology and Bioengineering, Santiago, Chile
3
Escuela de Informática y Telecomunicaciones, Diego Portales University, Ejército 441, Santiago, Chile
4
Department of Computer Science, University of Chile, Beauchef 851, Santiago, Chile
5
Department of Computer Science, University of Helsinki, Helsinki, Finland
6
Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK
*
Corresponding author:
Suffix trees are one of the most versatile data structures in stringology, with many applications in
bioinformatics. Their main drawback is their size, which can be tens of times larger than the input
sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while
using space proportional to a compressed representation of the sequence. In this work, we take a
new approach to compressed suffix trees for repetitive sequence collections, such as collections of
individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree
of a reference sequence. These relative data structures provide competitive time/space trade-offs,
being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.
Keywords: suffix trees; compressed text indexing; repetitive collections
Received 12 May 2017; revised 1 September 2017; editorial decision 16 October 2017;
Handling editor: Raphael Clifford
1.
INTRODUCTION
The suffix tree [1] is one of the most powerful bioinformatic
tools to answer complex queries on DNA and protein sequences
[2–4]. A serious problem that hampers its wider use on large
genome sequences is its size, which may be 10–20 bytes per
character. In addition, the non-local access patterns required by
most interesting problems solved with suffix trees complicate
secondary-memory deployments. This problem has led to
numerous efforts to reduce the size of suffix trees by representing them using compressed data structures [5–17], leading to
compressed suffix trees (CST). Currently, the smallest CST is
the so-called fully compressed suffix tree (FCST) [10, 14],
which uses 5 bits per character (bpc) for DNA sequences, but
takes milliseconds to simulate suffix tree navigation operations.
In the other extreme, Sadakane’s CST [5, 11] uses about 12 bpc
and operates in microseconds, and even nanoseconds for the
simplest operations.
A space usage of 12 bpc may seem reasonable to handle, for
example, one human genome, which has about 3.1 billion
bases: it can be operated within a RAM of 4.5 GB (the
representation contains the sequence as well). However, as the
price of sequencing has fallen, sequencing the genomes of a
large number of individuals has become a routine activity. The
1000 Genomes Project [18] sequenced the genomes of several
thousand humans, while newer projects can be orders of magnitude larger. This has made the development of techniques for
storing and analyzing huge amounts of sequence data flourish.
Just storing 1000 human genomes using a 12 bpc CST
requires almost 4.5 TB, which is much more than the amount
of memory available in a commodity server. Assuming that a
single server has 256 GB of memory, we would need a cluster
of 18 servers to handle such a collection of CSTs (compared
with over 100 with classical suffix tree implementations!). With
the smaller (and much slower) FCST, this would drop to 7–8
servers. It is clear that further space reductions in the representation of CST would lead to reductions in hardware, communication and energy costs when implementing complex searches
over large genomic databases.
An important characteristic of those large genome databases is that they usually consist of the genomes of individuals of the same or closely related species. This implies that
the collections are highly repetitive, that is, each genome can
SECTION A: COMPUTER SCIENCE THEORY, METHODS AND TOOLS
THE COMPUTER JOURNAL, VOL. 61 NO. 5, 2018
774
A. FARRUGGIA et al.
be obtained by concatenating a relatively small number of
substrings of other genomes and adding a few new characters.
When repetitiveness is considered, much higher compression
rates can be obtained in CST. For example, it is possible to
reduce the space to 1–2 bpc (albeit with operation times in
the milliseconds) [13], or to 2–3 bpc with operation times in
the microseconds [15]. Using 2 bpc, our 1000 genomes could
be handled with just three servers with 256 GB of memory.
Compression algorithms best capture repetitiveness by
using grammar-based compression or Lempel–Ziv compression.1 In the first case [19, 20], one finds a context-free grammar that generates (only) the text collection. Rather than
compressing the text directly, the current CSTs for repetitive
collections [13, 15] apply grammar-based compression on
the data structures that simulate the suffix tree. Grammarbased compression yields relatively easy direct access to the
compressed sequence [21], which makes it attractive compared to Lempel–Ziv compression [22], despite the latter
generally using less space.
Lempel–Ziv compression cuts the collection into phrases,
each of which has already appeared earlier in the collection.
To extract the content of a phrase, one may have to recursively extract the content at that earlier position, following a
possibly long chain of indirections. So far, the indexes built
on Lempel–Ziv compression [23] or on combinations of
Lempel–Ziv and grammar-based compression [24–26] support only pattern matching, which is just one of the wide
range of functionalities offered by suffix trees. The high cost
to access the data at random positions lies at the heart of the
research on indexes built on Lempel–Ziv compression.
A simple way out of this limitation is the so-called relative
Lempel–Ziv (RLZ) compression [27], where one of the
sequences is represented in plain form and the others can
only take phrases from that reference sequence. This enables
immediate access for the symbols inside any copied phrase
(as no transitive referencing exists) and, at least, if a good reference sequence has been found, offers compression competitive with the classical Lempel–Ziv. In our case, taking any
random genome per species as the reference is good enough;
more sophisticated techniques hav (...truncated)