Degenerate String Comparison and Applications
WA B I
Degenerate String Comparison and Applications
Mai Alzamel 0 1 2 3 4 5 6
Lorraine A. K. Ayad 0 1 2 3 4 5 6
Giulia Bernardini 0 1 2 3 4 5 6
Roberto Grossi 0 1 2 3 4 5 6
Costas S. Iliopoulos 0 1 2 3 4 5 6
Nadia Pisanti 0 1 2 3 4 5 6
Solon P. Pissis 0 1 2 3 4 5 6
Giovanna Rosone 0 1 2 3 4 5 6
0 Department of Informatics , Systems and Communication (DISCo) , University of MilanBicocca , Italy
1 Department of Informatics, King's College London , UK
2 Department of Informatics, King's College London, UK and Department of Computer Science, King Saud University , KSA
3 Department of Computer Science, University of Pisa , Italy and ERABLE Team, INRIA , France
4 Department of Computer Science, University of Pisa , Italy
5 Department of Informatics, King's College London , UK
6 Department of Informatics, King's College London , UK
A generalised degenerate string (GD string) Sˆ is a sequence of n sets of strings of total size N , where the ith set contains strings of the same length ki but this length can vary between different sets. We denote the sum of these lengths k0, k1, . . . , kn−1 by W . This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N +M )time algorithm for deciding whether the intersection 1 Partially supported by the project UNIPI PRA_2017_44 “Advanced computational methodologies for the analysis of biomedical data”. 2 Partially supported by the project UNIPI PRA_2017_44 “Advanced computational methodologies for the analysis of biomedical data”. 3 Partially supported by the project MIURSIR CMACBioSeq “Combinatorial methods for analysis and compression of biological sequences” grant n. RBSI146R5L and the project UNIPI PRA_2017_44 “Advanced computational methodologies for the analysis of biomedical data”. 4 Partially supported by the Royal Society project IE 161274 “Processing uncertain sequences: combinatorics and applications”. 5 Partially supported by the project MIURSIR CMACBioSeq “Combinatorial methods for analysis and compression of biological sequences” grant n. RBSI146R5L, the Royal Society project IE 161274 “Processing uncertain sequences: combinatorics and applications”, and the project UNIPI PRA_2017_44 “Advanced computational methodologies for the analysis of biomedical data”.

of two GD strings of total sizes N and M , respectively, over an integer alphabet, is nonempty.
This result is based on a combinatorial result of independent interest: although the intersection
of two GD strings can be exponential in the total size of the two strings, it can be represented in
only linear space. A similar result can be obtained by employing an automatabased approach
but its cost is alphabetdependent. We then apply our string comparison algorithm to compute
palindromes in GD strings. We present an O(min{W, n2}N )time algorithm for computing all
palindromes in Sˆ. Furthermore, we show a similar conditional lower bound for computing
maximal palindromes in Sˆ. Finally, proofofconcept experimental results are presented using real
protein datasets.
2012 ACM Subject Classification Theory of computation → Pattern matching
1
Introduction
A degenerate string (or indeterminate string) over an alphabet Σ is a sequence of subsets of
Σ. A great deal of research has been conducted on degenerate strings (see [1, 11, 20, 29, 32]
and references therein). These types of uncertain sequences have been used extensively for
flexible modelling of DNA sequences known as IUPACencoded DNA sequences [23].
In [19], the authors introduced a more general definition of degenerate strings: an
elasticdegenerate string (ED string) S˜ over Σ is a sequence of subsets of Σ∗ (see also network
expressions [28]) with the aim of representing multiple genomic sequences [10]. That is, any
set of S˜ does not contain, in general, only letters; a set may also contain strings, including the
empty string. In a few recent papers on this notion, the authors provided several algorithms
for pattern matching; specifically, for finding all exact [17] and approximate [8] occurrences
of a standard string pattern in an ED text.
We introduce here another special type of uncertain sequence called generalised degenerate
string; this can be viewed as an extension of degenerate strings or as a restricted variant of
ED strings. Formally, a generalised degenerate string (GD string) Sˆ over Σ is a sequence
of n sets of strings over Σ of total size N , where the ith set contains strings of the same
length ki > 0 but this length can vary between different sets. We denote the sum of these
lengths k0, k1, . . . , kn−1 by W . Thus a GD string can be used to represent a gapless multiple
sequence alignment (MSA) of fixed width, that is, for example, a highscoring local alignment
of multiple sequences, in a compact form; see Figure 1. This type of alignment is used for
finding functional sequence elements [14]. For instance, searching for palindromic motifs in
these type of alignments is an important problem since many transcription factors bind as
homodimers to palindromes [26]. Specifically, a set of virus species can be clustered using
highscoring MSA to obtain subsets of viruses that have a common hairpin structure [27].
Our motivation for this paper comes from finding palindromes in these types of uncertain
sequences. Let us start off with standard strings. A palindrome is a sequence that reads the
same from left to right and from right to left. Detection of palindromic factors in texts is a
classical and wellstudied problem in algorithms on strings and combinatorics on words with
a lot of variants arising out of different practical scenarios. In molecular biology, for instance,
palindromic sequences are extensively studied: they are often distributed around promoters,
introns, and untranslated regions, playing important roles in gene regulation and other cell
AGCTCTATCTCG
AGCCGAAGCTCG
AAGTCAACGCAG
(a) Multiple sequence alignment.
(b) Local gapless alignment.
Sˆ = {A} ·
TCT
(GACG) · CGA
TCA
· A ·
· G
TCTC
GCTC
CGCA
(c) GD string obtained from the local gapless alignment.
processes (e.g. see [4]). In particular these are strings of the form XX¯ R, also known as
complemented palindromes, occurring in singlestranded DNA or, more commonly, in RNA,
where X is a string and X¯ R is the reverse complement of X. In DNA, CG are complements
and AT are complements; in RNA, CG are complements and AU are complements.
A string X = X[0]X[1] . . . X[n − 1] is said to have an initial palindrome of length k if its
prefix of length k is a palindrome. Manacher first discovered an online algorithm that finds all
initial palindromes in a string [25]. Later Apostolico et al observed that the algorithm given
by Manacher is able to find all maximal palindromic factors in the string in O(n) time [6].
Gusfield gave an offline lineartime algorithm to find all maximal palindromes in a string
and also discussed the relation between biological sequences and gapped palindromes [18].
For uncertain sequences, we first need to have an algorithm for efficient string comparison,
where automata provide the following baseline. Let Xˆ and Yˆ be two GD (or two ED)
strings of total sizes N and M , respectively. We first build the nondeterministic finite
automaton (NFA) A of Xˆ and the NFA B of Yˆ in time O(N + M ). We then construct the
product NFA C such that L(C) = L(A) ∩ L(B) in time O(N M ). The nonemptiness decision
problem, namely, checking if L(C) 6= ∅, is decidable in time linear in the size of C, using
breadthfirst search (BFS). Hence the comparison of Xˆ and Yˆ can be done in time O(N M ).
It is known that if there existed faster methods for obtaining the automata intersection, then
significant improvements would be implied to many long standing open problems [24]. Hence
an immediate reduction to the problem of NFA intersection does not particularly help. For
GD strings we show at the beginning of Section 3 that we can build an adhoc deterministic
finite automaton (DFA) for Xˆ and Yˆ , so that the intersection can be performed efficiently,
but this simple solution cannot achieve O(N + M ) time as its cost is alphabetdependent.
Our Contribution. Our first result in this paper is an O(N +M )time algorithm for deciding
whether the intersection of two GD strings of sizes N and M , respectively, over an integer
alphabet is nonempty. This result is based on a combinatorial result of independent interest:
although the intersection of two GD strings can be exponential in the total size of the two
strings, it can be represented in only linear space. An automata model of computation can
also be employed to obtain these results but we present here an efficient implementation
in the standard word RAM model with word size w = Ω(log(N + M )) that works also for
integer alphabets. We then apply our string comparison tool to compute palindromes in GD
strings. We present an O(min{W, n2}N )time algorithm for computing all palindromes in Sˆ.
Furthermore, we show a nontrivial Ω(n2Σ) lower bound under the Strong Exponential Time
Hypothesis [21, 22] for computing all maximal palindromes. Note that there exists an infinite
family of GD strings over an integer alphabet of size Σ = Θ(N ) on which our algorithm
requires time O(n2N ) thus matching the conditional lower bound. Finally, proofofconcept
experimental results are presented using real protein datasets; specifically, on applying our
tools to find the location of palindromes in immunoglobulins genes of the human V regions.
2
Preliminaries
An alphabet Σ is a nonempty finite set of letters of size σ = Σ. A string X on an alphabet
Σ is a sequence of elements of Σ. The set of all strings on an alphabet Σ, including the empty
string ε of length 0, is denoted by Σ∗. For any string X, we denote by X[i . . . j] the substring
or factor of X that starts at position i and ends at position j. In particular, X[0 . . . j] is
the prefix of X that ends at position j, and X[i . . . X − 1] is the suffix of X that starts at
position i, where X denotes the length of X. The suffix tree of X (generalised suffix tree
for a set of strings) is a compact trie representing all suffixes of X. We denote the reversal
of X by string XR, i.e. XR = X[X − 1]X[X − 2] . . . X[0].
A string P is said to be a palindrome if and only if P = P R. If factor X[i . . . j],
0 ≤ i ≤ j ≤ n − 1, of string X of length n is a palindrome, then i+2j is the center of X[i . . . j]
in X and j−i+1 is the radius of X[i . . . j]. In other words, a palindrome is a string that reads
2
the same forward and backward, i.e. a string P is a palindrome if P = Y aY R where Y is a
string, Y R is the reversal of Y and a is either a single letter or the empty string. Moreover,
X[i . . . j] is called a palindromic factor of X. It is said to be a maximal palindrome if there
is no other palindrome in X with center i+j and larger radius. Hence X has exactly 2n − 1
2
maximal palindromes. A maximal palindrome P of X can be encoded as a pair (c, r), where
c is the center of P in X and r is the radius of P .
I Definition 1. A generalised degenerate string (GD string) Sˆ = Sˆ[0]Sˆ[1] . . . Sˆ[n − 1] of
length n over an alphabet Σ is a finite sequence of n degenerate letters. Every degenerate
letter Sˆ[i] of width ki > 0, denoted also by w(Sˆ[i]), is a finite nonempty set of strings
Sˆ[i][j] ∈ Σki , with 0 ≤ j < Sˆ[i]. For any GD string Sˆ, we denote by Sˆ[i] . . . Sˆ[j] the GD
substring of Sˆ that starts at position i and ends at position j.
I Definition 2. The total size N and total width W , denoted also by w(Sˆ), of a GD string
Sˆ are respectively defined as N = Pin=−01 Sˆ[i] × ki and W = Pin=−01 ki.
In this work, we generally consider GD strings over an integer alphabet of size σ = N O(1).
I Example 3. The GD string Sˆ of Figure 1(c) has length n = 6, size N = 28, and W = 12.
I Definition 4. Given two degenerate letters Xˆ and Yˆ , their Cartesian concatenation is
X ⊗ Yˆ = {xy  x ∈ Xˆ , y ∈ Yˆ }.
ˆ
When Yˆ = ∅ (resp. Xˆ = ∅) we set Xˆ ⊗ Yˆ = Xˆ (resp. = Yˆ ). Notice that ⊗ is associative.
I Definition 5. Consider a GD string Sˆ of length n. The language of Sˆ is
L(Sˆ) = Sˆ[0] ⊗ Sˆ[1] ⊗ · · · ⊗ Sˆ[n − 1].
Given two GD strings Rˆ and Sˆ of equal total width the intersection of their languages is
defined by L(Rˆ) ∩ L(Sˆ).
I Definition 6. Let Xˆ = { xi ∈ Σk } and Yˆ = { yj ∈ Σh } be two degenerate letters on
alphabet Σ. Further let us assume without loss of generality that Yˆ is the set that contains
the shorter strings (i.e. h ≤ k). We define the chop of Xˆ and Yˆ and the active suffixes of Xˆ
and Yˆ as follows:
chopXˆ,Yˆ = { yj ∈ Yˆ  yj matches a prefix of xi ∈ Xˆ }
activeXˆ,Yˆ = { xi[h . . . k − 1]  xi[0 . . . h − 1] ∈ chopXˆ,Yˆ }
Let w(chopXˆ,Yˆ ) = min{w(Xˆ ), w(Yˆ )}. When activeXˆ,Yˆ = {ε}, we set activeXˆ,Yˆ = ∅. We
then have that activeXˆ,Yˆ = ∅ either if h = k or if there is no match between any of the
strings in Yˆ and the prefix of a string in Xˆ ; i.e. chopXˆ,Yˆ = ∅.
I Example 7. Consider the following degenerate letters Xˆ and Yˆ where w(Yˆ ) < w(Xˆ ). The
underlined strings in letter Yˆ are prefixes of strings in letter Xˆ , hence they are in chopXˆ,Yˆ .
The suffixes of such strings in Xˆ are the active suffixes in activeXˆ,Yˆ .
Xˆ = TATCTCCCCTAGACA Yˆ = TCGCACCTA chopXˆ,Yˆ = (CTACTC) activeXˆ,Yˆ = (TAAC)
CATTA
I Definition 8. Let Rˆ and Sˆ be two GD strings of length r and s, respectively. Rˆ[0] . . . Rˆ[i]
is the prefix of Rˆ that ends at position i. It is called proper if i 6= r − 1. We say that
Rˆ[0] . . . Rˆ[i] is synchronized with Sˆ[0] . . . Sˆ[j] if w(Rˆ[0] . . . Rˆ[i]) = w(Sˆ[0] . . . Sˆ[j]). We call
these the shortest synchronized prefixes of Rˆ and Sˆ, respectively, when ∀ i0 < i, j0 < j
w(Rˆ[0] . . . Rˆ[i0]) 6= w(Sˆ[0] . . . Sˆ[j0]).
3
GD String Comparison
In this section, we consider the fundamental problem of GD string comparison. Let Rˆ and
Sˆ be of total size N and M , respectively. We provide an O(N + M )time algorithm in the
standard word RAM model with word size w = Ω(log(N + M )) that works also for integer
alphabets.
Before presenting our efficient implementation, we observe that there is the following
simple algorithm based on DFAs. Each degenerate letter of Rˆ and Sˆ can be represented by
a trie, where its leaves are collapsed to a single one. For every two consecutive degenerate
letters, the collapsed leaves of the former trie coincide with the root of the latter trie. An
acyclic DFA is obtained in this way, as illustrated in Appendix A. We can perform the
comparison of Rˆ and Sˆ by intersecting their corresponding DFAs using BFS on their product
DFA. The trivial upper bound on the number of reachable states is O(N M ), but this can
be improved to O(N + M ) by exploiting the structure of the two input DFAs. Each state
in such a DFA has a unique level: the common length of paths from the initial state; and
this structure is inherited by the product DFA. In other words, a leveli state in the product
DFA corresponds to a pair of leveli states in the input DFAs. Observe that a leveli state
in one DFA is uniquely represented by the label of the path from the root of its trie, and
for a fixed DFA and level, these labels have uniform lengths. Considering the two states
composing a reachable state in the product DFA, it is easy to see that the shorter label must
be a suffix of the longer label. Hence, the state in the DFA with longer labels at level i
uniquely determines the state in the DFA with shorter labels at level i. Consequently, the
number of reachable leveli states in the product DFA is bounded by the number of leveli
states in the input DFAs, and the size is O(N + M ).
We observe that the cost of implementing the above ideas has an extra logarithmic factor
due to state branching and, moreover, GD string comparisons require to build the DFAs
each time. We show how to obtain O(N + M ) time for integer alphabets, without creating
DFAs. We show that, even if the size of L(Rˆ) ∩ L(Sˆ) can be exponential in the total sizes of
Rˆ and Sˆ (Fact 9), the problem of GD string comparison, i.e. deciding whether L(Rˆ) ∩ L(Sˆ)
is nonempty, can be solved in time linear with respect to the sum of the total sizes of the
two GD strings (Theorem 17) and is thus of independent interest.
I Fact 9. Given two GD strings Rˆ and Sˆ, L(Sˆ) ∩ L(Rˆ) can have size exponential in the
total sizes of Rˆ and Sˆ.
We next show when it is possible to factorize L(Rˆ) ∩ L(Sˆ) into a Cartesian concatenation.
I Lemma 10. Consider two GD strings Sˆ = Sˆ0Sˆ00 and Rˆ = Rˆ0Rˆ00 such that w(Sˆ) = w(Rˆ).
If Sˆ0 is synchronized with Rˆ0, then L(Rˆ) ∩ L(Sˆ) = (L(Rˆ0) ∩ L(Sˆ0)) ⊗ (L(Rˆ00) ∩ L(Sˆ00)).
Proof. It is clear that L(Sˆ) ∩ L(Rˆ) ⊇ (L(Rˆ0) ∩ L(Sˆ0)) ⊗ (L(Sˆ00) ∩ L(Rˆ00)). Indeed, consider a
string x ∈ L(Rˆ0) ∩ L(Sˆ0) and a string y ∈ L(Sˆ00) ∩ L(Rˆ00): then, by the definition of Cartesian
concatenation, xy ∈ L(Rˆ0) ⊗ L(Rˆ00) = L(Rˆ) and xy ∈ L(Sˆ0) ⊗ L(Sˆ00) = L(Sˆ).
We now prove the opposite inclusion. Consider a string z ∈ L(Sˆ) ∩ L(Rˆ). By definition,
z = x0x1 . . . xr−1 = y0y1 . . . ys−1, with xi ∈ Rˆ[i], yj ∈ Sˆ[j], ∀ 0 ≤ i ≤ r − 1, ∀ 0 ≤ j ≤ s − 1.
Let Rˆ0 = Rˆ[0] . . . Rˆ[i], Sˆ0 = Sˆ[0] . . . Sˆ[j]. Assume by contradiction that z ∈/ (L(Rˆ0) ∩ L(Sˆ0)) ⊗
(L(Sˆ00) ∩ L(Rˆ00)): without loss of generality, x0 . . . xi ∈/ L(Sˆ0). Since L(Sˆ0) ⊗ L(Sˆ00) = L(Sˆ),
it follows that z = x0x1 . . . xr−1 ∈/ L(Sˆ) =⇒ z ∈/ L(Sˆ) ∩ L(Rˆ), that is a contradiction. J
By applying Lemma 10 wherever Rˆ and Sˆ have synchronized prefixes, we are then left
with the problem of intersecting GD strings with no synchronized proper prefixes. We now
define an alternative decomposition within such strings (see also Example 12).
I Definition 11. Let Rˆ and Sˆ be two GD strings of length r and s, respectively, with no
synchronized proper prefixes. We define
cchain(Rˆ, Sˆ) = mqax{0 ≤ q ≤ r + s − 2  chopq 6= ∅},
where chopi denotes the set chopAˆi,Bˆi , and (Aˆ0, Bˆ0), (Aˆ1, Bˆ1), . . . , (Aˆq, Bˆq), pos(Aˆi), pos(Bˆi)
are recursively defined as follows:
Aˆ0 = Rˆ[0], Bˆ0 = Sˆ[0], and pos(Aˆ0) = pos(Bˆ0) = 0. For 0 < i ≤ r + s − 2, if chopi−1 6= ∅,
Aˆi =
Bˆi =
activeAˆi−1,Bˆi−1 and pos(Aˆi) = pos(Aˆi−1)
(Rˆ[pos(Aˆi−1) + 1] and pos(Aˆi) = pos(Aˆi−1) + 1 if w(chopi−1) = w(Aˆi−1)
otherwise
activeAˆi−1,Bˆi−1 and pos(Bˆi) = pos(Bˆi−1)
(Sˆ[pos(Bˆi−1) + 1] and pos(Bˆi) = pos(Bˆi−1) + 1 if w(chopi−1) = w(Bˆi−1)
otherwise
The generation of pairs (Aˆi, Bˆi) stops at i = q either if q = r+s−2, or when chopq+1 = ∅,
in which case Rˆ and Sˆ only match until (Aˆq, Bˆq). Intuitively, Aˆi (respectively, Bˆi) represents
suffixes of the current position of Rˆ (respectively, of Sˆ), while pos(Bˆi) (respectively, pos(Aˆi))
tells which position of Rˆ (respectively, Sˆ) we are chopping.
I Example 12 (Definition 11). Consider the following GD strings Rˆ and Sˆ with no
synchronized proper prefixes: chop0 is the first red set from the left, chop1 is the first blue one, chop2
is the second red one, etc. The cchain(Rˆ, Sˆ) terminates when q = 7.
Rˆ = AAC AGG GACC{ˆA{zˆ0CATz1A{ˆGCCz2}}}·TA AAA{ˆA{zTG3ˆA{zˆ4z5}}}·CCGTTC AC{ACˆz6GGAA{ˆz7}}
T C T T C T C
Sˆ = {ˆAz} · GACG·TC CG AA · {ˆAz} ·GC GC CT AC· {ˆGz}
B0 B{ˆz1} B{ˆz2B{ˆz3}} B4  B{ˆB{zˆ5z6}} B7
I Definition 13. Let Rˆ and Sˆ be two GD strings of length r and s, respectively, with
w(Rˆ) = w(Sˆ) and no synchronized proper prefixes. We define GRˆ,Sˆ as a directed acyclic
graph with a structure of up to r + s − 1 levels, each node being a set of strings, as follows,
where we assume without loss of generality that w(Rˆ[0]) > w(Sˆ[0]):
Level k = 0: consists of a single node:
n0 = {x ∈ Rˆ[0]  x = y0 . . . yq0 with yj ∈ chopj ∀j : 0 ≤ j ≤ q0}, where q0 is the index of
the rightmost chop containing suffixes of Rˆ[0].
Level k > 0: consists of ` = chopqk−1  nodes. Assuming without loss of generality that level
k−1 has been built with suffixes of Rˆ[pos(Aˆqk−1 )], level k contains suffixes of a position
of Sˆ. Let c0, . . . , c`−1 denote the elements of chopqk−1 . Then, for 0 ≤ i ≤ `−1, the ith
node of level k is:
ˆ
ni = {yqk−1+1 . . . yqk  ciyqk−1+1 . . . yqk∈Bqk−1 with yj ∈ chopj ∀j : qk−1 +1 ≤ j ≤ qk}, where
qk is the index of the rightmost chop containing suffixes of Sˆ[pos(Bˆqk−1 )].
Every string in level k − 1 whose suffix is ci is the source of an edge having the whole
node ni as a sink.
We define paths(GRˆ,Sˆ) as the set of strings spelled by a path in GRˆ,Sˆ that starts at n0 and
ends at the last level.
Note that the size of GRˆ,Sˆ is at most linear in the sum of the sizes of Rˆ and Sˆ, as the
nodes contain strings either in Rˆ or in Sˆ with no duplications, and each node has outdegree
equal to the number of strings it contains.
I Example 14 (Definition 13). GRˆ,Sˆ for the GD strings Rˆ, Sˆ of Example 12 is:
q0 = 2 and the strings in level 0 belong to (chop0 ⊗ chop1 ⊗ chop2) ∩ Rˆ[0]. Level 1 contains
suffixes of strings in Bˆ2 (and of strings in Bˆ3 as chop3 = {A, T} and indeed q1 = 3), level 2
suffixes of strings in Aˆ3 (as q2 = 5), level 3 suffixes of strings in Bˆ5 (q3 = 6), level 4 suffixes
of strings in Aˆ6 (q4 = 7). The three paths from level 0 to level 4 correspond to the three
strings in L(Rˆ) ∩ L(Sˆ): AGCCGAATCTCG, AAGTCAATCTCG, AAGTCTAGCTCG.
Let GkRˆ,Sˆ be GRˆ,Sˆ truncated at level k, and let GkRˆ,Sˆ be the length of the strings it
spells. Let Lk(Sˆ) denote the set of prefixes of length GkRˆ,Sˆ of L(Sˆ).
Lk(Sˆ) ∩ Lk(Rˆ) 6= ∅.
I Lemma 15. Let Rˆ, Sˆ be two GD strings with w(Rˆ) = w(Sˆ) = W and no synchronized
proper prefixes. Then Lk(Sˆ) ∩ Lk(Rˆ) = paths(GkRˆ,Sˆ) for all levels k of GRˆ,Sˆ such that
Proof. Again, let us assume without loss of generality that w(Rˆ[0]) > w(Sˆ[0]). We prove
the result by induction on k.
[Level k = 0] By construction, n0 contains strings in Rˆ[0] ∩ (chop0 ⊗· · ·⊗chopq0 ), which
have length G0Rˆ,Sˆ, and are also in Sˆ[0], and hence belong to both L0(Sˆ) and L0(Rˆ).
[Level k > 0] By inductive hypothesis, we have that Lk−1(Sˆ) ∩ Lk−1(Rˆ) = paths(GkRˆ−,Sˆ1):
suppose that Lk(Sˆ) ∩ Lk(Rˆ) 6= ∅, otherwise the graph ends at level k − 1. We first show
that paths(GkRˆ,Sˆ) ⊆ Lk(Sˆ) ∩ Lk(Rˆ): by Definition 13, any z ∈ paths(GkRˆ,Sˆ) can be written as
z = z0z00 with z0 in paths(GkRˆ−,Sˆ1) and with z00 that belongs to some node at level k of GkRˆ,Sˆ
reached by an edge leaving a suffix of z0. By inductive hypothesis z0 ∈ Lk−1(Sˆ) ∩ Lk−1(Rˆ)
and, again by Definition 13, z00 ∈ chopqk−1+1 ⊗ · · · ⊗ chopqk ; since Lk(Sˆ) ∩ Lk(Rˆ) 6= ∅ these
chops are not empty, their concatenation contains the suffix of length GkRˆ,Sˆ − GkRˆ−,Sˆ1 of
strings in both Lk(Rˆ) and Lk(Sˆ), and hence z ∈ Lk(Sˆ) ∩ Lk(Rˆ).
We now show that Lk(Sˆ) ∩ Lk(Rˆ) ⊆ paths(GkRˆ,Sˆ): consider string u ∈ Lk(Sˆ) ∩ Lk(Rˆ)
that can be written as u = u0u00 with u0 the prefix of u having length GkRˆ−,Sˆ1 which then
belongs to Lk−1(Sˆ) ∩ Lk−1(Rˆ); then, by inductive hypothesis, u0 ∈ paths(GkRˆ−,Sˆ1) and, since
u ∈ Lk(Sˆ)∩Lk(Rˆ), then there is an edge linking a suffix of u0 at level k −1 with a node at level
k of GkRˆ,Sˆ containing a GkRˆ,Sˆ − GkRˆ−,Sˆ1 long suffix u00 of u, and hence u ∈ paths(GkRˆ,Sˆ). J
As a special case of Lemma 15, if L(Sˆ) ∩ L(Rˆ) 6= ∅, then GRˆ,Sˆ is built up to the last level
and the following holds.
I Theorem 16. Let Rˆ, Sˆ be two GD strings having lengths, respectively, r and s, with
w(Rˆ) = w(Sˆ) and no synchronized proper prefixes. Then GRˆ,Sˆ has exactly r + s − 1 levels,
and we have that L(Sˆ) ∩ L(Rˆ) = paths(GRˆ,Sˆ).
GRˆ,Sˆ is thus a linearsized representation of the possibly exponentialsized (Fact 9) set
L(Sˆ) ∩ L(Rˆ).
We now show an O(N + M )time algorithm for the standard word RAM model, denoted
by GDSC, that decides whether L(Rˆ) and L(Sˆ) share at least one string (returns 1)
or not (returns 0). GDSC starts with constructing the generalized suffix tree TRˆ,Sˆ of
all the strings in Rˆ and Sˆ. Then it scans Rˆ and Sˆ starting with Rˆ[0] and Sˆ[0] storing
in chopRˆ,Sˆ the latest chopi and in activeRˆ,Sˆ the latest activeAˆi,Bˆi using TRˆ,Sˆ. For an
efficient implementation, suffixes in activeRˆ,Sˆ are stored (e.g. for activeAˆ0,Bˆ0 assuming that
w(Rˆ[0]) > w(Sˆ[0])) as index positions of Rˆ[0] and the starting position of the suffix as
activeRˆ,Sˆ.suff. The next comparison is made between the corresponding suffixes of Rˆ[0] of
length w(Rˆ[0]) − activeRˆ,Sˆ.suff and Sˆ[1], identifying first the minimum length of the two, and
proceeding with the same process. The comparison of letters can be: (i) between Rˆ[i] and
Sˆ[j]; or (ii) between the corresponding strings of activeRˆ,Sˆ.index and Rˆ[i]; or (iii) between the
corresponding strings of activeRˆ,Sˆ.index and Sˆ[j]. If the two GD strings have a synchronized
proper prefix, this will result in activeRˆ,Sˆ = ∅ at positions i in Rˆ and j in Sˆ. At this point,
the comparison is restarted with the immediately following pair of degenerate letters.
I Theorem 17. Algorithm GDSC is correct. Given two GD strings Rˆ and Sˆ of total sizes
N and M , respectively, over an integer alphabet, algorithm GDSC requires O(N + M ) time.
Proof. The correctness follows directly from Lemma 10, Lemma 15, and Theorem 16.
Constructing the generalized suffix tree TRˆ,Sˆ can be done in time O(N + M ) [12]. For
the sets pair (Aˆi, Bˆi) as in Definition 11, such that w(Aˆi) = k and w(Aˆi) ≤ w(Bˆi), we query
TRˆ,Sˆ with the klength prefixes of strings in Bˆi. For integer alphabets, instead of spelling
the strings from the root of TRˆ,Sˆ, we locate the corresponding terminal nodes for (Aˆi, Bˆi). It
then suffices to find longest common prefixes between these suffixes to simulate the querying
process. Since all suffixes are lexicographically sorted during the construction of TRˆ,Sˆ, we
can also have the suffixes considered by pair (Aˆi, Bˆi) lexicographically ranked with respect
to (Aˆi, Bˆi). Hence we do not perform the longest common prefix operation for all possible
suffix pairs, but only for the lexicographically adjacent ones within this group. This can
be done in O(1) time per pair after O(N + M )time preprocessing over TRˆ,Sˆ [7]. chopi is
thus populated with the klength prefixes of strings in Bˆi found in Aˆi. The set activeAˆi,Bˆi of
active suffixes can be found by chopping the suffixes of the string in Bˆi from their prefixes
successfully queried in TRˆ,Sˆ. This requires time O(Aˆi + Bˆi) for processing (Aˆi, Bˆi).
Let Rˆ and Sˆ be of length r and s, respectively. Assume that Rˆ and Sˆ have no synchronized
proper prefixes. Then Theorem 16 ensures that the total number of comparisons cannot exceed
r + s − 2: this results in a time complexity of O(N + M + Pir=+0s−2(Aˆi + Bˆi)) = O(N + M ).
If Rˆ and Sˆ have synchronized proper prefixes, we perform the comparison up to the
shortest synchronized prefixes (i.e. the set of active suffixes becomes empty) and then restart
the procedure from the immediately following pair of degenerate letters. Clearly the total
number of comparisons also in this case cannot be more than r + s − 2. J
4
Computing Palindromes in GD Strings
Armed with the efficient GD string comparison tool, we shift our focus on our initial
motivation, namely, computing palindromes in GD strings.
I Definition 18. A GD string Sˆ is a GD palindrome if there exists a string in L(Sˆ) that is
a palindrome.
A GD palindrome Sˆ[i] . . . Sˆ[j] in Sˆ, whose total width is w(Sˆ[i] . . . Sˆ[j]), can be encoded
as a pair (c, r), where its center is c = w(Sˆ[0]...Sˆ[i−1])+w(Sˆ[0]...Sˆ[j])−1 , when i > 0, otherwise,
2
c = w(Sˆ[0]...Sˆ[j])−1 , when i = 0; its radius is r = w(Sˆ[i]...Sˆ[j]) . Sˆ[i] . . . Sˆ[j] is called maximal
2 2
if no other GD palindrome (c, r0) exists in Sˆ with r0 > r. Note that we only consider the
GD palindromes Sˆ[i] . . . Sˆ[j] that start with the first letter of some string X ∈ Sˆ[i] and end
with the last letter of some string Y ∈ Sˆ[j], while the center can be anywhere: in between or
inside degenerate letters. That is, in Sˆ there are 2 · w(Sˆ) − 1 = 2W − 1 possible centers.
I Example 19. Consider the GD string Sˆ of Figure 1(c) where palindromes are underlined;
one starts at Sˆ[0] and ends at Sˆ[2]: it corresponds to (c, r) = (2.5, 3). A second palindrome
starts at Sˆ[4] and ends at Sˆ[5]: it corresponds to (c, r) = (9, 2.5).
In this section, we consider the following problem. Given a GD string Sˆ of length n, total
size N , and total width W , find all GD strings Sˆ[i] . . . Sˆ[j], with 0 ≤ i ≤ j ≤ n − 1, that are
GD palindromes. We give two alternative algorithms: one finds all GD palindromes seeking
them for all (i, j) pairs; and the other one finds them starting from all possible centers. The
two algorithms have different time complexities: which one is faster depends on W , N , and
n. In fact, they compute all GD palindromes, but report only the maximal ones.
We first describe algorithm MaxPalPairs. For all i, j positions within Sˆ, in order to
check whether Sˆ[i] . . . Sˆ[j] is a GD palindrome, we apply the GDSC algorithm to Sˆ[i] . . . Sˆ[j]
and its reverse, denoted by rev(Sˆ[i] . . . Sˆ[j]); the reverse is defined by reversing the sequence
of degenerate letters and also reversing the strings in every degenerate letter. GD palindromes
are, finally, sorted per center, and the maximal GD palindromes are reported. Sorting the
(i, j) pairs by their centers can be done in O(W ) time using bucket sort, which is bounded
by O(N ) since N ≥ W .
Since there are O(n2) pairs (i, j), and since by Theorem 17 algorithm GDSC takes time
proportional to the total size of Sˆ[i] . . . Sˆ[j] to check whether Sˆ[i] . . . Sˆ[j] is a GD palindrome,
algorithm MaxPalPairs takes O(n2N ) time in total. In algorithm MaxPalCenters,
we consider all possible centers c of Sˆ. In the case when c is in between two degenerate
letters we simply try to extend to the left and to the right via applying GDSC. In the
case when c is inside a degenerate letter we intuitively split the letter vertically into two
letters and try to extend to the left and to the right via applying GDSC. At each extension
step of this procedure we maintain two GD strings Lˆ (left of the center) and Rˆ (right
of the center) such that they are of the same total width. We consider the reverse of Lˆ
(similar to algorithm MaxPalPairs) for the comparison. In the case where c occurs inside a
degenerate letter to make sure we do not identify palindromes which do not exist, for all j
split strings of the degenerate letter, we check that LˆR[0][j][0 . . . k − 1] = Rˆ[0][j][0 . . . k − 1]
where LˆR = rev(Lˆ) and k = min(w(LR[0]), w(Rˆ[0])). If no matches are found, we move
onto the next center. Otherwise, when a match is found, we update rev(Lˆ) and Rˆ with the
remainder of the split degenerate letter (if its length is greater than k), as well as the next
degenerate letters. Algorithm GDSC is applied to compare rev(Lˆ) and Rˆ. After a positive
comparison, we overwrite Lˆ and Rˆ by adding the degenerate letters of the current extension
until w(Lˆ) = w(Rˆ) (or until the end of the string is reached). This process is repeated as
long as GDSC returns a positive comparison, that is, until the maximal GD palindrome with
center c is found. The radius reported is then the total sum of all values of w(Lˆ). If GDSC
returns a negative comparison at center c, we proceed with the next center, because we
clearly cannot have a GD palindrome centered at c extended further if rev(Lˆ) ∩ Rˆ is empty.
By Theorem 17 and the fact that there are 2W − 1 possible centers, we have that
algorithm MaxPalCenters takes O(W N ) time in total. We obtain the following result.
I Theorem 20. Given a GD string of length n, total size N , and total width W , over an
integer alphabet, all (maximal) GD palindromes can be computed in time O(min{W, n2}N ).
The problem that gained significant attention recently is the factorization of a string X of
length n into a sequence of palindromes [3, 13, 30, 9, 5, 2]. We say that X1, X2, . . . , X` is
a (maximal) palindromic factorization of string X, if every Xi is a (maximal) palindrome,
X = X1X2 . . . X`, and ` is minimal. In biological applications we need to factorize a sequence
into palindromes in order to identify hairpins, patterns that occur in singlestranded DNA
or, more commonly, in RNA. Next, we define and solve the same problem for GD strings.
I Definition 21. A (maximal) GD palindromic factorization of a GD string Sˆ is a sequence
Pˆ1, . . . , Pˆ` of GD strings, such that: (i) every Pˆi is either a (maximal) GD palindrome or a
degenerate letter of Sˆ; (ii) Sˆ = Pˆ1 . . . Pˆ`; (iii) ` is minimal.
After locating all (maximal) GD palindromes in Sˆ using Theorem 20, we are in a
position to amend the algorithm of Alatabbi et al [3] to find a (maximal) GD palindromic
factorization of Sˆ. We define a directed graph GSˆ = (V, E ), where V = {i  0 ≤ i ≤ n} and
E = {(i, j + 1)  Sˆ[i . . . j] (maximal) GD palindrome of Sˆ} ∪ {(i, i + 1)0 ≤ i < n}. Note
that V contains a node n being the sink of edges representing (maximal) GD palindromes
ending at Sˆ[n − 1]. For maximal GD palindromes, E contains no more than 3W edges, as the
maximum number of maximal GD palindromes is 2W − 1. For GD palindromes, E contains
O(n2) edges, as the maximum number of GD palindromes is O(n2). A shortest path in GSˆ
from 0 to n gives a (maximal) GD palindromic factorization. For maximal GD palindromes,
the size of GSˆ is O(W ), as n ≤ W , and so finding this shortest path requires O(W ) time
using a standard algorithm. For GD palindromes, the size of GSˆ, and thus the time, is O(n2).
I Theorem 22. Given a GD string Sˆ of length n, total size N , and total width W , over an
integer alphabet, a (maximal) GD palindromic factorization of Sˆ can be computed in time
O(min{W, n2}N ).
5
A Conditional Lower Bound under SETH
In this section, we show a conditional lower bound for computing palindromes in
degenerate strings. Let us first define the 2Orthogonal Vectors problem. Given two sets
A = {α1, α2, . . . , αn} and B = {β1, β2, . . . , βn} of dbit vectors, where d = ω(log n), the
2Orthogonal Vectors problem asks the following question: is there any pair αi, βj of vectors
that is orthogonal? Namely, is Pd−1
k=0 αi[k] · βj[k] equal to 0? For the moderate dimension of
this problem, we follow [16], assuming n2− dO(1) ≤ n2d. The following result is known.
I Theorem 23 ([16, 21, 22, 33]). The 2Orthogonal Vectors problem cannot be solved in
O(n2− · dO(1)) time, for any > 0, unless the Strong Exponential Time Hypothesis fails.
We next show that the 2Orthogonal Vectors problem can be reduced to computing
maximal palindromes in degenerate strings thus obtaining a similar conditional lower bound
to the upper bound obtained in Theorem 20 for computing all GD palindromes.
I Theorem 24. Given a degenerate string of length 4n over an alphabet of size σ = ω(log n),
all maximal GD palindromes cannot be computed in O(n2− · σO(1)) time, for any > 0,
unless the Strong Exponential Time Hypothesis fails.
Proof. Let d = σ and consider the alphabet Σ = {0, 1, . . . , σ − 1}. We say that two subsets
of Σ match if they have a common element. Given a dbit vector α, we define μ(α) to be
the following subset of Σ: s ∈ μ(α) if and only if α[s] = 1. Thus, two vectors α and β are
orthogonal if and only if the sets μ(α) and μ(β) are disjoint. In the string comparison setting,
two degenerate letters μ(α) and μ(β) do not match if and only if α and β are orthogonal.
The reduction works as follows. Given A = {α1, α2, . . . , αn} and B = {β1, β2, . . . , βn}, we
construct the following simple degenerate string of length 4n in time O(nσ):
S = μ(α1)μ(β1)μ(α2)μ(β2) . . . μ(αn)μ(βn) μ(α1)μ(β1)μ(α2)μ(β2) . . . μ(αn)μ(βn).
Then the 2Orthogonal Vectors problem for the sets A and B has a positive answer if
and only if at any position of S, from 0 to 2n, there does not occur a palindrome of length at
least 2n. All such occurrences can be easily verified from the respective palindrome centers
in time O(n). In other words, if at any position of S there does not occur a palindrome of
length at least 2n, this is because we have a mismatch between a pair μ(αi), μ(βj) of letters,
which implies that there exists a pair αi, βj of orthogonal vectors. Also, by the construction,
all such pairs are to be (implicitly) compared, and thus, if there exists any pair that is
orthogonal the corresponding mismatch will result in a palindrome of length less than 2n. J
6
Experimental Results
We present here a proofofconcept experiment but we anticipate that the algorithmic tools
developed in this paper are applicable in a wide range of biological applications.
We first obtained the amino acid sequences of 5 immunoglobulins within the human
V regions [15] and converted these into mRNA sequences [31]. The letters X, S, T, Y, Z, R
and H were replaced by degenerate letters according to IUPAC [23]. Each other letter,
c ∈ {A, C, G, U}, was treated as a single degenerate letter {c}. An average of 47% of the total
number of positions within the 5 sequences consisted of one of the following: X, S, T, Y, Z, R and
H. We then used algorithm MaxPalPairs to find all maximal palindromes in the 5 sequences.
Table 1 shows the palindromes identified within hypervariable regions I and II. Our results are
in accordance with Wuilmart et al [34] who presented a statistical (fundamentally different)
method to identify the location of palindromes within regions of immunoglobulin genes. The
ranges we report are greater than or equal to the ones of [34] due to the maximality criterion.
Their product DFA gives their intersection: ACACAAC and CCCACCC.
C r3, s3
C
C
A
C
30
31
32
33
34
Mikhail Rubinchik and Arseny M. Shur. Eertree: An efficient data structure for processing
palindromes in strings. In IWOCA, volume 9538 of LNCS, pages 321–333. Springer
International Publishing, 2016.
Randall T. Schuh. Major patterns in vertebrate evolution. Systematic Biology, 27(2):172,
1978.
Henry Soldano, Alain Viari, and Marc Champesme. Searching for flexible repeated patterns
using a nontransitive similarity relation. Pattern Recognition Letters, 16(3):233–246, 1995.
Ryan Williams. A new algorithm for optimal 2constraint satisfaction and its implications.
Theor. Comput. Sci, 348(23):357–365, 2005.
C. Wuilmart, J. Urbain, and D. Givol. On the location of palindromes in immunoglobulin
genes. Proceedings of the National Academy of Sciences of the United States of America,
74(6):2526–2530, 1977.
APPENDIX
A
GD String Comparison Using Automata
I Example 25. We illustrate here a simple automatabased approach. Say we want to
compare the following two GD strings:
Rˆ =
(AC)
(ACAAC)
CC · CACCC
We construct the DFA for Rˆ and the DFA for Sˆ.
start
r0
s4
A
C
C
A
C
A
r6
r7
s6
s7
A
C
A
C
r8
r9
start
start
C
C
Karl Abrahamson . Generalized string matching . SIAM J. Comput. , 16 ( 6 ): 1039  1051 , 1987 .
Michał Adamczyk , Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos, and Jakub Radoszewski . Palindromic decompositions with gaps and errors . In CSR , volume 10304 of LNCS , pages 48  61 . Springer International Publishing, 2017 .
Ali Alatabbi , Costas S. Iliopoulos, and M. Sohel Rahman . Maximal palindromic factorization . In PSC , pages 70  77 , 2013 .
Yannis Almirantis , Panagiotis Charalampopoulos, Jia Gao , Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis , and Dimitris Polychronopoulos . On avoided words, absent words, and their application to biological sequence analysis . Algorithms for Molecular Biology , 12 ( 1 ): 5 , 2017 .
Mai Alzamel , Jia Gao , Costas S. Iliopoulos , Chang Liu, and Solon P. Pissis . Efficient computation of palindromes in sequences with uncertainties . In EANN , volume 744 of CCIS , pages 620  629 . Springer, 2017 .
Alberto Apostolico , Dany Breslauer, and Zvi Galil . Parallel detection of all palindromes in a string . Theoretical Computer Science , 141 ( 1 ): 163  173 , 1995 .
Michael A. Bender and Martín FarachColton . The LCA problem revisited . In LATIN , volume 1776 of LNCS , pages 88  94 . Springer, 2000 .
Giulia Bernardini , Nadia Pisanti, Solon P. Pissis , and Giovanna Rosone . Pattern matching on elasticdegenerate text with errors . In SPIRE , volume 10508 of LNCS , pages 74  90 .
Springer , 2017 .
Kirill Borozdin , Dmitry Kosolobov, Mikhail Rubinchik, and Arseny M. Shur . Palindromic Length in Linear Time . In CPM , volume 78 of LIPIcs , pages 23 : 1  23 : 12 . Schloss DagstuhlLeibnizZentrum fuer Informatik, 2017 .
The Computational PanGenomics Consortium . Computational pangenomics: status, promises and challenges . Briefings in Bioinformatics, pages 1  18 , 2016 .
Maxime Crochemore , Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen . Covering problems for partial words and for indeterminate strings . Theoretical Computer Science , 698 : 25  39 , 2017 .
Martin Farach . Optimal suffix tree construction with large alphabets . In FOCS , pages 137  143 . IEEE, 1997 .
Gabriele Fici , Travis Gagie, Juha Kärkkäinen, and Dominik Kempa . A subquadratic algorithm for minimum palindromic factorization . Journal of Discrete Algorithms , 28 : 41  48 , 2014 .
Martin C. Frith , Ulla Hansen, John L. Spouge, and Zhiping Weng . Finding functional sequence elements by multiple local alignment . Nucleic Acids Res ., 32 ( 1 ): 189  200 , 2004 .
J. A. Gally and G. M. Edelman . The genetic control of immunoglobulin synthesis . Annual Review of Genetics , 6 ( 1 ): 1  46 , 1972 .
Jiawei Gao and Russell Impagliazzo . Orthogonal vectors is hard for firstorder properties on sparse graphs . Electronic Colloquium on Computational Complexity (ECCC) , 23 : 53 , 2016 .
Roberto Grossi , Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari . Online pattern matching on a set of similar texts . In CPM, LIPIcs. Schloss DagstuhlLeibnizZentrum fuer Informatik , 2017 .
Dan Gusfield . Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology . Cambridge University Press, New York, NY, USA, 1997 .
Costas S. Iliopoulos , Ritu Kundu, and Solon P. Pissis . Efficient pattern matching in elasticdegenerate texts . In LATA , volume 10168 of LNCS , pages 131  142 . Springer International Publishing, 2017 .
Costas S. Iliopoulos and Jakub Radoszewski . Truly SubquadraticTime Extension Queries and Periodicity Detection in Strings with Uncertainties . In CPM , volume 54 of LIPIcs , pages 8 : 1  8 : 12 , Dagstuhl , Germany, 2016 . Schloss DagstuhlLeibnizZentrum fuer Informatik .
Sci., 62 ( 2 ): 367  375 , 2001 .
Russell Impagliazzo , Ramamohan Paturi, and Francis Zane . Which problems have strongly exponential complexity ? J. Comput. Syst. Sci. , 63 ( 4 ): 512  530 , 2001 .
IUPACIUB Commission on Biochemical Nomenclature. Abbreviations and symbols for nucleic acids, polynucleotides, and their constituents . Biochemistry , 9 ( 20 ): 4022  4027 , 1970 .
Richard J. Lipton . On The Intersection of Finite Automata , pages 145  148 . Springer US, Boston, MA, 2010 .
Glenn Manacher . A new lineartime “online” algorithm for finding the smallest initial palindrome of a string . Journal of the ACM , 22 ( 3 ): 346  351 , 1975 .
Lee Ann McCue , William Thompson , Steven Carmack, Michael P. Ryan , Jun S. Liu, Victoria Derbyshire, and Charles E. Lawrence . Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes . Nucleic Acids Res ., 29 ( 3 ): 774  782 , 2001 .
Brejnev Muhizi Muhire , Michael Golden, Ben Murrell, Pierre Lefeuvre, JeanMichel Lett , Alistair Gray, Art YF Poon, Nobubelo Kwanele Ngandu, Yves Semegni, Emil Pavlov Tanov, et al. Evidence of pervasive biologically functional secondary structures within the genomes of eukaryotic singlestranded DNA viruses . Journal of virology , 88 ( 4 ): 1972  1989 , 2014 .
Eugene W Myers . Approximate matching of network expressions with spacers . Journal of Computational Biology , 3 ( 1 ): 33  51 , 1996 .
Nadia Pisanti , Henry Soldano, Mathilde Carpentier, and Joël Pothier . A relational extension of the notion of motifs: Application to the common 3d protein substructures searching problem . Journal of Computational Biology , 16 ( 12 ): 1635  1660 , 2009 .