Degenerate String Comparison and Applications

LIPICS - Leibniz International Proceedings in Informatics, Jul 2018

A generalised degenerate string (GD string) S^ is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length k_i but this length can vary between different sets. We denote the sum of these lengths k_0, k_1,...,k_{n-1} by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings. We present an O(min{W,n^2}N)-time algorithm for computing all palindromes in S^. Furthermore, we show a similar conditional lower bound for computing maximal palindromes in S^. Finally, proof-of-concept experimental results are presented using real protein datasets.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://drops.dagstuhl.de/opus/volltexte/2018/9323/pdf/LIPIcs-WABI-2018-21.pdf

Degenerate String Comparison and Applications

WA B I Degenerate String Comparison and Applications Mai Alzamel 0 1 2 3 4 5 6 Lorraine A. K. Ayad 0 1 2 3 4 5 6 Giulia Bernardini 0 1 2 3 4 5 6 Roberto Grossi 0 1 2 3 4 5 6 Costas S. Iliopoulos 0 1 2 3 4 5 6 Nadia Pisanti 0 1 2 3 4 5 6 Solon P. Pissis 0 1 2 3 4 5 6 Giovanna Rosone 0 1 2 3 4 5 6 0 Department of Informatics , Systems and Communication (DISCo) , University of Milan-Bicocca , Italy 1 Department of Informatics, King's College London , UK 2 Department of Informatics, King's College London, UK and Department of Computer Science, King Saud University , KSA 3 Department of Computer Science, University of Pisa , Italy and ERABLE Team, INRIA , France 4 Department of Computer Science, University of Pisa , Italy 5 Department of Informatics, King's College London , UK 6 Department of Informatics, King's College London , UK A generalised degenerate string (GD string) Sˆ is a sequence of n sets of strings of total size N , where the ith set contains strings of the same length ki but this length can vary between different sets. We denote the sum of these lengths k0, k1, . . . , kn−1 by W . This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N +M )-time algorithm for deciding whether the intersection 1 Partially supported by the project UNIPI PRA_2017_44 “Advanced computational methodologies for the analysis of biomedical data”. 2 Partially supported by the project UNIPI PRA_2017_44 “Advanced computational methodologies for the analysis of biomedical data”. 3 Partially supported by the project MIUR-SIR CMACBioSeq “Combinatorial methods for analysis and compression of biological sequences” grant n. RBSI146R5L and the project UNIPI PRA_2017_44 “Advanced computational methodologies for the analysis of biomedical data”. 4 Partially supported by the Royal Society project IE 161274 “Processing uncertain sequences: combinatorics and applications”. 5 Partially supported by the project MIUR-SIR CMACBioSeq “Combinatorial methods for analysis and compression of biological sequences” grant n. RBSI146R5L, the Royal Society project IE 161274 “Processing uncertain sequences: combinatorics and applications”, and the project UNIPI PRA_2017_44 “Advanced computational methodologies for the analysis of biomedical data”. - of two GD strings of total sizes N and M , respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings. We present an O(min{W, n2}N )-time algorithm for computing all palindromes in Sˆ. Furthermore, we show a similar conditional lower bound for computing maximal palindromes in Sˆ. Finally, proof-of-concept experimental results are presented using real protein datasets. 2012 ACM Subject Classification Theory of computation → Pattern matching 1 Introduction A degenerate string (or indeterminate string) over an alphabet Σ is a sequence of subsets of Σ. A great deal of research has been conducted on degenerate strings (see [1, 11, 20, 29, 32] and references therein). These types of uncertain sequences have been used extensively for flexible modelling of DNA sequences known as IUPAC-encoded DNA sequences [23]. In [19], the authors introduced a more general definition of degenerate strings: an elasticdegenerate string (ED string) S˜ over Σ is a sequence of subsets of Σ∗ (see also network expressions [28]) with the aim of representing multiple genomic sequences [10]. That is, any set of S˜ does not contain, in general, only letters; a set may also contain strings, including the empty string. In a few recent papers on this notion, the authors provided several algorithms for pattern matching; specifically, for finding all exact [17] and approximate [8] occurrences of a standard string pattern in an ED text. We introduce here another special type of uncertain sequence called generalised degenerate string; this can be viewed as an extension of degenerate strings or as a restricted variant of ED strings. Formally, a generalised degenerate string (GD string) Sˆ over Σ is a sequence of n sets of strings over Σ of total size N , where the ith set contains strings of the same length ki > 0 but this length can vary between different sets. We denote the sum of these lengths k0, k1, . . . , kn−1 by W . Thus a GD string can be used to represent a gapless multiple sequence alignment (MSA) of fixed width, that is, for example, a high-scoring local alignment of multiple sequences, in a compact form; see Figure 1. This type of alignment is used for finding functional sequence elements [14]. For instance, searching for palindromic motifs in these type of alignments is an important problem since many transcription factors bind as homodimers to palindromes [26]. Specifically, a set of virus species can be clustered using high-scoring MSA to obtain subsets of viruses that have a common hairpin structure [27]. Our motivation for this paper comes from finding palindromes in these types of uncertain sequences. Let us start off with standard strings. A palindrome is a sequence that reads the same from left to right and from right to left. Detection of palindromic factors in texts is a classical and well-studied problem in algorithms on strings and combinatorics on words with a lot of variants arising out of different practical scenarios. In molecular biology, for instance, palindromic sequences are extensively studied: they are often distributed around promoters, introns, and untranslated regions, playing important roles in gene regulation and other cell AGCTCTATCTCG AGCCGAAGCTCG AAGTCAACGCAG (a) Multiple sequence alignment. (b) Local gapless alignment. Sˆ = {A} · TCT (GACG) · CGA TCA · A · · G TCTC GCTC CGCA (c) GD string obtained from the local gapless alignment. processes (e.g. see [4]). In particular these are strings of the form XX¯ R, also known as complemented palindromes, occurring in single-stranded DNA or, more commonly, in RNA, where X is a string and X¯ R is the reverse complement of X. In DNA, C-G are complements and A-T are complements; in RNA, C-G are complements and A-U are complements. A string X = X[0]X[1] . . . X[n − 1] is said to have an initial palindrome of length k if its prefix of length k is a palindrome. Manacher first discovered an on-line algorithm that finds all initial palindromes in a string [25]. Later Apostolico et al observed that the algorithm given by Manacher is able to find all maximal palindromic factors in the string in O(n) time [6]. Gusfield gave an off-line linear-time algorithm to find all maximal palindromes in a string and also discussed the relation between biological sequences and gapped palindromes [18]. For uncertain sequences, we first need to have an algorithm for efficient string comparison, where automata provide the following baseline. Let Xˆ and Yˆ be two GD (or two ED) strings of total sizes N and M , respectively. We first build the non-deterministic finite automaton (NFA) A of Xˆ and the NFA B of Yˆ in time O(N + M ). We then construct the product NFA C such that L(C) = L(A) ∩ L(B) in time O(N M ). The non-emptiness decision problem, namely, checking if L(C) 6= ∅, is decidable in time linear in the size of C, using breadth-first search (BFS). Hence the comparison of Xˆ and Yˆ can be done in time O(N M ). It is known that if there existed faster methods for obtaining the automata intersection, then significant improvements would be implied to many long standing open problems [24]. Hence an immediate reduction to the problem of NFA intersection does not particularly help. For GD strings we show at the beginning of Section 3 that we can build an ad-hoc deterministic finite automaton (DFA) for Xˆ and Yˆ , so that the intersection can be performed efficiently, but this simple solution cannot achieve O(N + M ) time as its cost is alphabet-dependent. Our Contribution. Our first result in this paper is an O(N +M )-time algorithm for deciding whether the intersection of two GD strings of sizes N and M , respectively, over an integer alphabet is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. An automata model of computation can also be employed to obtain these results but we present here an efficient implementation in the standard word RAM model with word size w = Ω(log(N + M )) that works also for integer alphabets. We then apply our string comparison tool to compute palindromes in GD strings. We present an O(min{W, n2}N )-time algorithm for computing all palindromes in Sˆ. Furthermore, we show a non-trivial Ω(n2|Σ|) lower bound under the Strong Exponential Time Hypothesis [21, 22] for computing all maximal palindromes. Note that there exists an infinite family of GD strings over an integer alphabet of size |Σ| = Θ(N ) on which our algorithm requires time O(n2N ) thus matching the conditional lower bound. Finally, proof-of-concept experimental results are presented using real protein datasets; specifically, on applying our tools to find the location of palindromes in immunoglobulins genes of the human V regions. 2 Preliminaries An alphabet Σ is a non-empty finite set of letters of size σ = |Σ|. A string X on an alphabet Σ is a sequence of elements of Σ. The set of all strings on an alphabet Σ, including the empty string ε of length 0, is denoted by Σ∗. For any string X, we denote by X[i . . . j] the substring or factor of X that starts at position i and ends at position j. In particular, X[0 . . . j] is the prefix of X that ends at position j, and X[i . . . |X| − 1] is the suffix of X that starts at position i, where |X| denotes the length of X. The suffix tree of X (generalised suffix tree for a set of strings) is a compact trie representing all suffixes of X. We denote the reversal of X by string XR, i.e. XR = X[|X| − 1]X[|X| − 2] . . . X[0]. A string P is said to be a palindrome if and only if P = P R. If factor X[i . . . j], 0 ≤ i ≤ j ≤ n − 1, of string X of length n is a palindrome, then i+2j is the center of X[i . . . j] in X and j−i+1 is the radius of X[i . . . j]. In other words, a palindrome is a string that reads 2 the same forward and backward, i.e. a string P is a palindrome if P = Y aY R where Y is a string, Y R is the reversal of Y and a is either a single letter or the empty string. Moreover, X[i . . . j] is called a palindromic factor of X. It is said to be a maximal palindrome if there is no other palindrome in X with center i+j and larger radius. Hence X has exactly 2n − 1 2 maximal palindromes. A maximal palindrome P of X can be encoded as a pair (c, r), where c is the center of P in X and r is the radius of P . I Definition 1. A generalised degenerate string (GD string) Sˆ = Sˆ[0]Sˆ[1] . . . Sˆ[n − 1] of length n over an alphabet Σ is a finite sequence of n degenerate letters. Every degenerate letter Sˆ[i] of width ki > 0, denoted also by w(Sˆ[i]), is a finite non-empty set of strings Sˆ[i][j] ∈ Σki , with 0 ≤ j < |Sˆ[i]|. For any GD string Sˆ, we denote by Sˆ[i] . . . Sˆ[j] the GD substring of Sˆ that starts at position i and ends at position j. I Definition 2. The total size N and total width W , denoted also by w(Sˆ), of a GD string Sˆ are respectively defined as N = Pin=−01 |Sˆ[i]| × ki and W = Pin=−01 ki. In this work, we generally consider GD strings over an integer alphabet of size σ = N O(1). I Example 3. The GD string Sˆ of Figure 1(c) has length n = 6, size N = 28, and W = 12. I Definition 4. Given two degenerate letters Xˆ and Yˆ , their Cartesian concatenation is X ⊗ Yˆ = {xy | x ∈ Xˆ , y ∈ Yˆ }. ˆ When Yˆ = ∅ (resp. Xˆ = ∅) we set Xˆ ⊗ Yˆ = Xˆ (resp. = Yˆ ). Notice that ⊗ is associative. I Definition 5. Consider a GD string Sˆ of length n. The language of Sˆ is L(Sˆ) = Sˆ[0] ⊗ Sˆ[1] ⊗ · · · ⊗ Sˆ[n − 1]. Given two GD strings Rˆ and Sˆ of equal total width the intersection of their languages is defined by L(Rˆ) ∩ L(Sˆ). I Definition 6. Let Xˆ = { xi ∈ Σk } and Yˆ = { yj ∈ Σh } be two degenerate letters on alphabet Σ. Further let us assume without loss of generality that Yˆ is the set that contains the shorter strings (i.e. h ≤ k). We define the chop of Xˆ and Yˆ and the active suffixes of Xˆ and Yˆ as follows: chopXˆ,Yˆ = { yj ∈ Yˆ | yj matches a prefix of xi ∈ Xˆ } activeXˆ,Yˆ = { xi[h . . . k − 1] | xi[0 . . . h − 1] ∈ chopXˆ,Yˆ } Let w(chopXˆ,Yˆ ) = min{w(Xˆ ), w(Yˆ )}. When activeXˆ,Yˆ = {ε}, we set activeXˆ,Yˆ = ∅. We then have that activeXˆ,Yˆ = ∅ either if h = k or if there is no match between any of the strings in Yˆ and the prefix of a string in Xˆ ; i.e. chopXˆ,Yˆ = ∅. I Example 7. Consider the following degenerate letters Xˆ and Yˆ where w(Yˆ ) < w(Xˆ ). The underlined strings in letter Yˆ are prefixes of strings in letter Xˆ , hence they are in chopXˆ,Yˆ . The suffixes of such strings in Xˆ are the active suffixes in activeXˆ,Yˆ . Xˆ = TATCTCCCCTAGACA Yˆ = TCGCACCTA chopXˆ,Yˆ = (CTACTC) activeXˆ,Yˆ = (TAAC) CATTA I Definition 8. Let Rˆ and Sˆ be two GD strings of length r and s, respectively. Rˆ[0] . . . Rˆ[i] is the prefix of Rˆ that ends at position i. It is called proper if i 6= r − 1. We say that Rˆ[0] . . . Rˆ[i] is synchronized with Sˆ[0] . . . Sˆ[j] if w(Rˆ[0] . . . Rˆ[i]) = w(Sˆ[0] . . . Sˆ[j]). We call these the shortest synchronized prefixes of Rˆ and Sˆ, respectively, when ∀ i0 < i, j0 < j w(Rˆ[0] . . . Rˆ[i0]) 6= w(Sˆ[0] . . . Sˆ[j0]). 3 GD String Comparison In this section, we consider the fundamental problem of GD string comparison. Let Rˆ and Sˆ be of total size N and M , respectively. We provide an O(N + M )-time algorithm in the standard word RAM model with word size w = Ω(log(N + M )) that works also for integer alphabets. Before presenting our efficient implementation, we observe that there is the following simple algorithm based on DFAs. Each degenerate letter of Rˆ and Sˆ can be represented by a trie, where its leaves are collapsed to a single one. For every two consecutive degenerate letters, the collapsed leaves of the former trie coincide with the root of the latter trie. An acyclic DFA is obtained in this way, as illustrated in Appendix A. We can perform the comparison of Rˆ and Sˆ by intersecting their corresponding DFAs using BFS on their product DFA. The trivial upper bound on the number of reachable states is O(N M ), but this can be improved to O(N + M ) by exploiting the structure of the two input DFAs. Each state in such a DFA has a unique level: the common length of paths from the initial state; and this structure is inherited by the product DFA. In other words, a level-i state in the product DFA corresponds to a pair of level-i states in the input DFAs. Observe that a level-i state in one DFA is uniquely represented by the label of the path from the root of its trie, and for a fixed DFA and level, these labels have uniform lengths. Considering the two states composing a reachable state in the product DFA, it is easy to see that the shorter label must be a suffix of the longer label. Hence, the state in the DFA with longer labels at level i uniquely determines the state in the DFA with shorter labels at level i. Consequently, the number of reachable level-i states in the product DFA is bounded by the number of level-i states in the input DFAs, and the size is O(N + M ). We observe that the cost of implementing the above ideas has an extra logarithmic factor due to state branching and, moreover, GD string comparisons require to build the DFAs each time. We show how to obtain O(N + M ) time for integer alphabets, without creating DFAs. We show that, even if the size of L(Rˆ) ∩ L(Sˆ) can be exponential in the total sizes of Rˆ and Sˆ (Fact 9), the problem of GD string comparison, i.e. deciding whether L(Rˆ) ∩ L(Sˆ) is non-empty, can be solved in time linear with respect to the sum of the total sizes of the two GD strings (Theorem 17) and is thus of independent interest. I Fact 9. Given two GD strings Rˆ and Sˆ, L(Sˆ) ∩ L(Rˆ) can have size exponential in the total sizes of Rˆ and Sˆ. We next show when it is possible to factorize L(Rˆ) ∩ L(Sˆ) into a Cartesian concatenation. I Lemma 10. Consider two GD strings Sˆ = Sˆ0Sˆ00 and Rˆ = Rˆ0Rˆ00 such that w(Sˆ) = w(Rˆ). If Sˆ0 is synchronized with Rˆ0, then L(Rˆ) ∩ L(Sˆ) = (L(Rˆ0) ∩ L(Sˆ0)) ⊗ (L(Rˆ00) ∩ L(Sˆ00)). Proof. It is clear that L(Sˆ) ∩ L(Rˆ) ⊇ (L(Rˆ0) ∩ L(Sˆ0)) ⊗ (L(Sˆ00) ∩ L(Rˆ00)). Indeed, consider a string x ∈ L(Rˆ0) ∩ L(Sˆ0) and a string y ∈ L(Sˆ00) ∩ L(Rˆ00): then, by the definition of Cartesian concatenation, xy ∈ L(Rˆ0) ⊗ L(Rˆ00) = L(Rˆ) and xy ∈ L(Sˆ0) ⊗ L(Sˆ00) = L(Sˆ). We now prove the opposite inclusion. Consider a string z ∈ L(Sˆ) ∩ L(Rˆ). By definition, z = x0x1 . . . xr−1 = y0y1 . . . ys−1, with xi ∈ Rˆ[i], yj ∈ Sˆ[j], ∀ 0 ≤ i ≤ r − 1, ∀ 0 ≤ j ≤ s − 1. Let Rˆ0 = Rˆ[0] . . . Rˆ[i], Sˆ0 = Sˆ[0] . . . Sˆ[j]. Assume by contradiction that z ∈/ (L(Rˆ0) ∩ L(Sˆ0)) ⊗ (L(Sˆ00) ∩ L(Rˆ00)): without loss of generality, x0 . . . xi ∈/ L(Sˆ0). Since L(Sˆ0) ⊗ L(Sˆ00) = L(Sˆ), it follows that z = x0x1 . . . xr−1 ∈/ L(Sˆ) =⇒ z ∈/ L(Sˆ) ∩ L(Rˆ), that is a contradiction. J By applying Lemma 10 wherever Rˆ and Sˆ have synchronized prefixes, we are then left with the problem of intersecting GD strings with no synchronized proper prefixes. We now define an alternative decomposition within such strings (see also Example 12). I Definition 11. Let Rˆ and Sˆ be two GD strings of length r and s, respectively, with no synchronized proper prefixes. We define c-chain(Rˆ, Sˆ) = mqax{0 ≤ q ≤ r + s − 2 | chopq 6= ∅}, where chopi denotes the set chopAˆi,Bˆi , and (Aˆ0, Bˆ0), (Aˆ1, Bˆ1), . . . , (Aˆq, Bˆq), pos(Aˆi), pos(Bˆi) are recursively defined as follows: Aˆ0 = Rˆ[0], Bˆ0 = Sˆ[0], and pos(Aˆ0) = pos(Bˆ0) = 0. For 0 < i ≤ r + s − 2, if chopi−1 6= ∅, Aˆi = Bˆi = activeAˆi−1,Bˆi−1 and pos(Aˆi) = pos(Aˆi−1) (Rˆ[pos(Aˆi−1) + 1] and pos(Aˆi) = pos(Aˆi−1) + 1 if w(chopi−1) = w(Aˆi−1) otherwise activeAˆi−1,Bˆi−1 and pos(Bˆi) = pos(Bˆi−1) (Sˆ[pos(Bˆi−1) + 1] and pos(Bˆi) = pos(Bˆi−1) + 1 if w(chopi−1) = w(Bˆi−1) otherwise The generation of pairs (Aˆi, Bˆi) stops at i = q either if q = r+s−2, or when chopq+1 = ∅, in which case Rˆ and Sˆ only match until (Aˆq, Bˆq). Intuitively, Aˆi (respectively, Bˆi) represents suffixes of the current position of Rˆ (respectively, of Sˆ), while pos(Bˆi) (respectively, pos(Aˆi)) tells which position of Rˆ (respectively, Sˆ) we are chopping. I Example 12 (Definition 11). Consider the following GD strings Rˆ and Sˆ with no synchronized proper prefixes: chop0 is the first red set from the left, chop1 is the first blue one, chop2 is the second red one, etc. The c-chain(Rˆ, Sˆ) terminates when q = 7. Rˆ = |AAC |AGG GACC{ˆA{zˆ0|CATz1A{ˆGCCz2}}}·|TA |AAA{ˆ|A{zTG3ˆA{zˆ4z5}}}·|CCGTTC AC{ACˆz6|GGAA{ˆz7}} T C T  T C T C Sˆ = |{ˆAz} · GACG·TC CG AA · |{ˆAz} ·GC GC CT AC· |{ˆGz} B0 |B{ˆz1}| B{ˆ|z2B{ˆz3}} B4 | |B{ˆB{zˆ5z6}} B7 I Definition 13. Let Rˆ and Sˆ be two GD strings of length r and s, respectively, with w(Rˆ) = w(Sˆ) and no synchronized proper prefixes. We define GRˆ,Sˆ as a directed acyclic graph with a structure of up to r + s − 1 levels, each node being a set of strings, as follows, where we assume without loss of generality that w(Rˆ[0]) > w(Sˆ[0]): Level k = 0: consists of a single node: n0 = {x ∈ Rˆ[0] | x = y0 . . . yq0 with yj ∈ chopj ∀j : 0 ≤ j ≤ q0}, where q0 is the index of the rightmost chop containing suffixes of Rˆ[0]. Level k > 0: consists of ` = |chopqk−1 | nodes. Assuming without loss of generality that level k−1 has been built with suffixes of Rˆ[pos(Aˆqk−1 )], level k contains suffixes of a position of Sˆ. Let c0, . . . , c`−1 denote the elements of chopqk−1 . Then, for 0 ≤ i ≤ `−1, the i-th node of level k is: ˆ ni = {yqk−1+1 . . . yqk | ciyqk−1+1 . . . yqk∈Bqk−1 with yj ∈ chopj ∀j : qk−1 +1 ≤ j ≤ qk}, where qk is the index of the rightmost chop containing suffixes of Sˆ[pos(Bˆqk−1 )]. Every string in level k − 1 whose suffix is ci is the source of an edge having the whole node ni as a sink. We define paths(GRˆ,Sˆ) as the set of strings spelled by a path in GRˆ,Sˆ that starts at n0 and ends at the last level. Note that the size of GRˆ,Sˆ is at most linear in the sum of the sizes of Rˆ and Sˆ, as the nodes contain strings either in Rˆ or in Sˆ with no duplications, and each node has out-degree equal to the number of strings it contains. I Example 14 (Definition 13). GRˆ,Sˆ for the GD strings Rˆ, Sˆ of Example 12 is: q0 = 2 and the strings in level 0 belong to (chop0 ⊗ chop1 ⊗ chop2) ∩ Rˆ[0]. Level 1 contains suffixes of strings in Bˆ2 (and of strings in Bˆ3 as chop3 = {A, T} and indeed q1 = 3), level 2 suffixes of strings in Aˆ3 (as q2 = 5), level 3 suffixes of strings in Bˆ5 (q3 = 6), level 4 suffixes of strings in Aˆ6 (q4 = 7). The three paths from level 0 to level 4 correspond to the three strings in L(Rˆ) ∩ L(Sˆ): AGCCGAATCTCG, AAGTCAATCTCG, AAGTCTAGCTCG. Let GkRˆ,Sˆ be GRˆ,Sˆ truncated at level k, and let |GkRˆ,Sˆ| be the length of the strings it spells. Let Lk(Sˆ) denote the set of prefixes of length |GkRˆ,Sˆ| of L(Sˆ). Lk(Sˆ) ∩ Lk(Rˆ) 6= ∅. I Lemma 15. Let Rˆ, Sˆ be two GD strings with w(Rˆ) = w(Sˆ) = W and no synchronized proper prefixes. Then Lk(Sˆ) ∩ Lk(Rˆ) = paths(GkRˆ,Sˆ) for all levels k of GRˆ,Sˆ such that Proof. Again, let us assume without loss of generality that w(Rˆ[0]) > w(Sˆ[0]). We prove the result by induction on k. [Level k = 0] By construction, n0 contains strings in Rˆ[0] ∩ (chop0 ⊗· · ·⊗chopq0 ), which have length |G0Rˆ,Sˆ|, and are also in Sˆ[0], and hence belong to both L0(Sˆ) and L0(Rˆ). [Level k > 0] By inductive hypothesis, we have that Lk−1(Sˆ) ∩ Lk−1(Rˆ) = paths(GkRˆ−,Sˆ1): suppose that Lk(Sˆ) ∩ Lk(Rˆ) 6= ∅, otherwise the graph ends at level k − 1. We first show that paths(GkRˆ,Sˆ) ⊆ Lk(Sˆ) ∩ Lk(Rˆ): by Definition 13, any z ∈ paths(GkRˆ,Sˆ) can be written as z = z0z00 with z0 in paths(GkRˆ−,Sˆ1) and with z00 that belongs to some node at level k of GkRˆ,Sˆ reached by an edge leaving a suffix of z0. By inductive hypothesis z0 ∈ Lk−1(Sˆ) ∩ Lk−1(Rˆ) and, again by Definition 13, z00 ∈ chopqk−1+1 ⊗ · · · ⊗ chopqk ; since Lk(Sˆ) ∩ Lk(Rˆ) 6= ∅ these chops are not empty, their concatenation contains the suffix of length |GkRˆ,Sˆ| − |GkRˆ−,Sˆ1| of strings in both Lk(Rˆ) and Lk(Sˆ), and hence z ∈ Lk(Sˆ) ∩ Lk(Rˆ). We now show that Lk(Sˆ) ∩ Lk(Rˆ) ⊆ paths(GkRˆ,Sˆ): consider string u ∈ Lk(Sˆ) ∩ Lk(Rˆ) that can be written as u = u0u00 with u0 the prefix of u having length |GkRˆ−,Sˆ1| which then belongs to Lk−1(Sˆ) ∩ Lk−1(Rˆ); then, by inductive hypothesis, u0 ∈ paths(GkRˆ−,Sˆ1) and, since u ∈ Lk(Sˆ)∩Lk(Rˆ), then there is an edge linking a suffix of u0 at level k −1 with a node at level k of GkRˆ,Sˆ containing a |GkRˆ,Sˆ| − |GkRˆ−,Sˆ1| long suffix u00 of u, and hence u ∈ paths(GkRˆ,Sˆ). J As a special case of Lemma 15, if L(Sˆ) ∩ L(Rˆ) 6= ∅, then GRˆ,Sˆ is built up to the last level and the following holds. I Theorem 16. Let Rˆ, Sˆ be two GD strings having lengths, respectively, r and s, with w(Rˆ) = w(Sˆ) and no synchronized proper prefixes. Then GRˆ,Sˆ has exactly r + s − 1 levels, and we have that L(Sˆ) ∩ L(Rˆ) = paths(GRˆ,Sˆ). GRˆ,Sˆ is thus a linear-sized representation of the possibly exponential-sized (Fact 9) set L(Sˆ) ∩ L(Rˆ). We now show an O(N + M )-time algorithm for the standard word RAM model, denoted by GDSC, that decides whether L(Rˆ) and L(Sˆ) share at least one string (returns 1) or not (returns 0). GDSC starts with constructing the generalized suffix tree TRˆ,Sˆ of all the strings in Rˆ and Sˆ. Then it scans Rˆ and Sˆ starting with Rˆ[0] and Sˆ[0] storing in chopRˆ,Sˆ the latest chopi and in activeRˆ,Sˆ the latest activeAˆi,Bˆi using TRˆ,Sˆ. For an efficient implementation, suffixes in activeRˆ,Sˆ are stored (e.g. for activeAˆ0,Bˆ0 assuming that w(Rˆ[0]) > w(Sˆ[0])) as index positions of Rˆ[0] and the starting position of the suffix as activeRˆ,Sˆ.suff. The next comparison is made between the corresponding suffixes of Rˆ[0] of length w(Rˆ[0]) − activeRˆ,Sˆ.suff and Sˆ[1], identifying first the minimum length of the two, and proceeding with the same process. The comparison of letters can be: (i) between Rˆ[i] and Sˆ[j]; or (ii) between the corresponding strings of activeRˆ,Sˆ.index and Rˆ[i]; or (iii) between the corresponding strings of activeRˆ,Sˆ.index and Sˆ[j]. If the two GD strings have a synchronized proper prefix, this will result in activeRˆ,Sˆ = ∅ at positions i in Rˆ and j in Sˆ. At this point, the comparison is restarted with the immediately following pair of degenerate letters. I Theorem 17. Algorithm GDSC is correct. Given two GD strings Rˆ and Sˆ of total sizes N and M , respectively, over an integer alphabet, algorithm GDSC requires O(N + M ) time. Proof. The correctness follows directly from Lemma 10, Lemma 15, and Theorem 16. Constructing the generalized suffix tree TRˆ,Sˆ can be done in time O(N + M ) [12]. For the sets pair (Aˆi, Bˆi) as in Definition 11, such that w(Aˆi) = k and w(Aˆi) ≤ w(Bˆi), we query TRˆ,Sˆ with the k-length prefixes of strings in Bˆi. For integer alphabets, instead of spelling the strings from the root of TRˆ,Sˆ, we locate the corresponding terminal nodes for (Aˆi, Bˆi). It then suffices to find longest common prefixes between these suffixes to simulate the querying process. Since all suffixes are lexicographically sorted during the construction of TRˆ,Sˆ, we can also have the suffixes considered by pair (Aˆi, Bˆi) lexicographically ranked with respect to (Aˆi, Bˆi). Hence we do not perform the longest common prefix operation for all possible suffix pairs, but only for the lexicographically adjacent ones within this group. This can be done in O(1) time per pair after O(N + M )-time pre-processing over TRˆ,Sˆ [7]. chopi is thus populated with the k-length prefixes of strings in Bˆi found in Aˆi. The set activeAˆi,Bˆi of active suffixes can be found by chopping the suffixes of the string in Bˆi from their prefixes successfully queried in TRˆ,Sˆ. This requires time O(|Aˆi| + |Bˆi|) for processing (Aˆi, Bˆi). Let Rˆ and Sˆ be of length r and s, respectively. Assume that Rˆ and Sˆ have no synchronized proper prefixes. Then Theorem 16 ensures that the total number of comparisons cannot exceed r + s − 2: this results in a time complexity of O(N + M + Pir=+0s−2(|Aˆi| + |Bˆi|)) = O(N + M ). If Rˆ and Sˆ have synchronized proper prefixes, we perform the comparison up to the shortest synchronized prefixes (i.e. the set of active suffixes becomes empty) and then restart the procedure from the immediately following pair of degenerate letters. Clearly the total number of comparisons also in this case cannot be more than r + s − 2. J 4 Computing Palindromes in GD Strings Armed with the efficient GD string comparison tool, we shift our focus on our initial motivation, namely, computing palindromes in GD strings. I Definition 18. A GD string Sˆ is a GD palindrome if there exists a string in L(Sˆ) that is a palindrome. A GD palindrome Sˆ[i] . . . Sˆ[j] in Sˆ, whose total width is w(Sˆ[i] . . . Sˆ[j]), can be encoded as a pair (c, r), where its center is c = w(Sˆ[0]...Sˆ[i−1])+w(Sˆ[0]...Sˆ[j])−1 , when i > 0, otherwise, 2 c = w(Sˆ[0]...Sˆ[j])−1 , when i = 0; its radius is r = w(Sˆ[i]...Sˆ[j]) . Sˆ[i] . . . Sˆ[j] is called maximal 2 2 if no other GD palindrome (c, r0) exists in Sˆ with r0 > r. Note that we only consider the GD palindromes Sˆ[i] . . . Sˆ[j] that start with the first letter of some string X ∈ Sˆ[i] and end with the last letter of some string Y ∈ Sˆ[j], while the center can be anywhere: in between or inside degenerate letters. That is, in Sˆ there are 2 · w(Sˆ) − 1 = 2W − 1 possible centers. I Example 19. Consider the GD string Sˆ of Figure 1(c) where palindromes are underlined; one starts at Sˆ[0] and ends at Sˆ[2]: it corresponds to (c, r) = (2.5, 3). A second palindrome starts at Sˆ[4] and ends at Sˆ[5]: it corresponds to (c, r) = (9, 2.5). In this section, we consider the following problem. Given a GD string Sˆ of length n, total size N , and total width W , find all GD strings Sˆ[i] . . . Sˆ[j], with 0 ≤ i ≤ j ≤ n − 1, that are GD palindromes. We give two alternative algorithms: one finds all GD palindromes seeking them for all (i, j) pairs; and the other one finds them starting from all possible centers. The two algorithms have different time complexities: which one is faster depends on W , N , and n. In fact, they compute all GD palindromes, but report only the maximal ones. We first describe algorithm MaxPalPairs. For all i, j positions within Sˆ, in order to check whether Sˆ[i] . . . Sˆ[j] is a GD palindrome, we apply the GDSC algorithm to Sˆ[i] . . . Sˆ[j] and its reverse, denoted by rev(Sˆ[i] . . . Sˆ[j]); the reverse is defined by reversing the sequence of degenerate letters and also reversing the strings in every degenerate letter. GD palindromes are, finally, sorted per center, and the maximal GD palindromes are reported. Sorting the (i, j) pairs by their centers can be done in O(W ) time using bucket sort, which is bounded by O(N ) since N ≥ W . Since there are O(n2) pairs (i, j), and since by Theorem 17 algorithm GDSC takes time proportional to the total size of Sˆ[i] . . . Sˆ[j] to check whether Sˆ[i] . . . Sˆ[j] is a GD palindrome, algorithm MaxPalPairs takes O(n2N ) time in total. In algorithm MaxPalCenters, we consider all possible centers c of Sˆ. In the case when c is in between two degenerate letters we simply try to extend to the left and to the right via applying GDSC. In the case when c is inside a degenerate letter we intuitively split the letter vertically into two letters and try to extend to the left and to the right via applying GDSC. At each extension step of this procedure we maintain two GD strings Lˆ (left of the center) and Rˆ (right of the center) such that they are of the same total width. We consider the reverse of Lˆ (similar to algorithm MaxPalPairs) for the comparison. In the case where c occurs inside a degenerate letter to make sure we do not identify palindromes which do not exist, for all j split strings of the degenerate letter, we check that LˆR[0][j][0 . . . k − 1] = Rˆ[0][j][0 . . . k − 1] where LˆR = rev(Lˆ) and k = min(w(LR[0]), w(Rˆ[0])). If no matches are found, we move onto the next center. Otherwise, when a match is found, we update rev(Lˆ) and Rˆ with the remainder of the split degenerate letter (if its length is greater than k), as well as the next degenerate letters. Algorithm GDSC is applied to compare rev(Lˆ) and Rˆ. After a positive comparison, we overwrite Lˆ and Rˆ by adding the degenerate letters of the current extension until w(Lˆ) = w(Rˆ) (or until the end of the string is reached). This process is repeated as long as GDSC returns a positive comparison, that is, until the maximal GD palindrome with center c is found. The radius reported is then the total sum of all values of w(Lˆ). If GDSC returns a negative comparison at center c, we proceed with the next center, because we clearly cannot have a GD palindrome centered at c extended further if rev(Lˆ) ∩ Rˆ is empty. By Theorem 17 and the fact that there are 2W − 1 possible centers, we have that algorithm MaxPalCenters takes O(W N ) time in total. We obtain the following result. I Theorem 20. Given a GD string of length n, total size N , and total width W , over an integer alphabet, all (maximal) GD palindromes can be computed in time O(min{W, n2}N ). The problem that gained significant attention recently is the factorization of a string X of length n into a sequence of palindromes [3, 13, 30, 9, 5, 2]. We say that X1, X2, . . . , X` is a (maximal) palindromic factorization of string X, if every Xi is a (maximal) palindrome, X = X1X2 . . . X`, and ` is minimal. In biological applications we need to factorize a sequence into palindromes in order to identify hairpins, patterns that occur in single-stranded DNA or, more commonly, in RNA. Next, we define and solve the same problem for GD strings. I Definition 21. A (maximal) GD palindromic factorization of a GD string Sˆ is a sequence Pˆ1, . . . , Pˆ` of GD strings, such that: (i) every Pˆi is either a (maximal) GD palindrome or a degenerate letter of Sˆ; (ii) Sˆ = Pˆ1 . . . Pˆ`; (iii) ` is minimal. After locating all (maximal) GD palindromes in Sˆ using Theorem 20, we are in a position to amend the algorithm of Alatabbi et al [3] to find a (maximal) GD palindromic factorization of Sˆ. We define a directed graph GSˆ = (V, E ), where V = {i | 0 ≤ i ≤ n} and E = {(i, j + 1) | Sˆ[i . . . j] (maximal) GD palindrome of Sˆ} ∪ {(i, i + 1)|0 ≤ i < n}. Note that V contains a node n being the sink of edges representing (maximal) GD palindromes ending at Sˆ[n − 1]. For maximal GD palindromes, E contains no more than 3W edges, as the maximum number of maximal GD palindromes is 2W − 1. For GD palindromes, E contains O(n2) edges, as the maximum number of GD palindromes is O(n2). A shortest path in GSˆ from 0 to n gives a (maximal) GD palindromic factorization. For maximal GD palindromes, the size of GSˆ is O(W ), as n ≤ W , and so finding this shortest path requires O(W ) time using a standard algorithm. For GD palindromes, the size of GSˆ, and thus the time, is O(n2). I Theorem 22. Given a GD string Sˆ of length n, total size N , and total width W , over an integer alphabet, a (maximal) GD palindromic factorization of Sˆ can be computed in time O(min{W, n2}N ). 5 A Conditional Lower Bound under SETH In this section, we show a conditional lower bound for computing palindromes in degenerate strings. Let us first define the 2-Orthogonal Vectors problem. Given two sets A = {α1, α2, . . . , αn} and B = {β1, β2, . . . , βn} of d-bit vectors, where d = ω(log n), the 2-Orthogonal Vectors problem asks the following question: is there any pair αi, βj of vectors that is orthogonal? Namely, is Pd−1 k=0 αi[k] · βj[k] equal to 0? For the moderate dimension of this problem, we follow [16], assuming n2− dO(1) ≤ n2d. The following result is known. I Theorem 23 ([16, 21, 22, 33]). The 2-Orthogonal Vectors problem cannot be solved in O(n2− · dO(1)) time, for any > 0, unless the Strong Exponential Time Hypothesis fails. We next show that the 2-Orthogonal Vectors problem can be reduced to computing maximal palindromes in degenerate strings thus obtaining a similar conditional lower bound to the upper bound obtained in Theorem 20 for computing all GD palindromes. I Theorem 24. Given a degenerate string of length 4n over an alphabet of size σ = ω(log n), all maximal GD palindromes cannot be computed in O(n2− · σO(1)) time, for any > 0, unless the Strong Exponential Time Hypothesis fails. Proof. Let d = σ and consider the alphabet Σ = {0, 1, . . . , σ − 1}. We say that two subsets of Σ match if they have a common element. Given a d-bit vector α, we define μ(α) to be the following subset of Σ: s ∈ μ(α) if and only if α[s] = 1. Thus, two vectors α and β are orthogonal if and only if the sets μ(α) and μ(β) are disjoint. In the string comparison setting, two degenerate letters μ(α) and μ(β) do not match if and only if α and β are orthogonal. The reduction works as follows. Given A = {α1, α2, . . . , αn} and B = {β1, β2, . . . , βn}, we construct the following simple degenerate string of length 4n in time O(nσ): S = μ(α1)μ(β1)μ(α2)μ(β2) . . . μ(αn)μ(βn) μ(α1)μ(β1)μ(α2)μ(β2) . . . μ(αn)μ(βn). Then the 2-Orthogonal Vectors problem for the sets A and B has a positive answer if and only if at any position of S, from 0 to 2n, there does not occur a palindrome of length at least 2n. All such occurrences can be easily verified from the respective palindrome centers in time O(n). In other words, if at any position of S there does not occur a palindrome of length at least 2n, this is because we have a mismatch between a pair μ(αi), μ(βj) of letters, which implies that there exists a pair αi, βj of orthogonal vectors. Also, by the construction, all such pairs are to be (implicitly) compared, and thus, if there exists any pair that is orthogonal the corresponding mismatch will result in a palindrome of length less than 2n. J 6 Experimental Results We present here a proof-of-concept experiment but we anticipate that the algorithmic tools developed in this paper are applicable in a wide range of biological applications. We first obtained the amino acid sequences of 5 immunoglobulins within the human V regions [15] and converted these into mRNA sequences [31]. The letters X, S, T, Y, Z, R and H were replaced by degenerate letters according to IUPAC [23]. Each other letter, c ∈ {A, C, G, U}, was treated as a single degenerate letter {c}. An average of 47% of the total number of positions within the 5 sequences consisted of one of the following: X, S, T, Y, Z, R and H. We then used algorithm MaxPalPairs to find all maximal palindromes in the 5 sequences. Table 1 shows the palindromes identified within hypervariable regions I and II. Our results are in accordance with Wuilmart et al [34] who presented a statistical (fundamentally different) method to identify the location of palindromes within regions of immunoglobulin genes. The ranges we report are greater than or equal to the ones of [34] due to the maximality criterion. Their product DFA gives their intersection: ACACAAC and CCCACCC. C r3, s3 C C A C 30 31 32 33 34 Mikhail Rubinchik and Arseny M. Shur. Eertree: An efficient data structure for processing palindromes in strings. In IWOCA, volume 9538 of LNCS, pages 321–333. Springer International Publishing, 2016. Randall T. Schuh. Major patterns in vertebrate evolution. Systematic Biology, 27(2):172, 1978. Henry Soldano, Alain Viari, and Marc Champesme. Searching for flexible repeated patterns using a non-transitive similarity relation. Pattern Recognition Letters, 16(3):233–246, 1995. Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci, 348(2-3):357–365, 2005. C. Wuilmart, J. Urbain, and D. Givol. On the location of palindromes in immunoglobulin genes. Proceedings of the National Academy of Sciences of the United States of America, 74(6):2526–2530, 1977. APPENDIX A GD String Comparison Using Automata I Example 25. We illustrate here a simple automata-based approach. Say we want to compare the following two GD strings: Rˆ = (AC) (ACAAC) CC · CACCC We construct the DFA for Rˆ and the DFA for Sˆ. start r0 s4 A C C A C A r6 r7 s6 s7 A C A C r8 r9 start start C C Karl Abrahamson . Generalized string matching . SIAM J. Comput. , 16 ( 6 ): 1039 - 1051 , 1987 . Michał Adamczyk , Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos, and Jakub Radoszewski . Palindromic decompositions with gaps and errors . In CSR , volume 10304 of LNCS , pages 48 - 61 . Springer International Publishing, 2017 . Ali Alatabbi , Costas S. Iliopoulos, and M. Sohel Rahman . Maximal palindromic factorization . In PSC , pages 70 - 77 , 2013 . Yannis Almirantis , Panagiotis Charalampopoulos, Jia Gao , Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis , and Dimitris Polychronopoulos . On avoided words, absent words, and their application to biological sequence analysis . Algorithms for Molecular Biology , 12 ( 1 ): 5 , 2017 . Mai Alzamel , Jia Gao , Costas S. Iliopoulos , Chang Liu, and Solon P. Pissis . Efficient computation of palindromes in sequences with uncertainties . In EANN , volume 744 of CCIS , pages 620 - 629 . Springer, 2017 . Alberto Apostolico , Dany Breslauer, and Zvi Galil . Parallel detection of all palindromes in a string . Theoretical Computer Science , 141 ( 1 ): 163 - 173 , 1995 . Michael A. Bender and Martín Farach-Colton . The LCA problem revisited . In LATIN , volume 1776 of LNCS , pages 88 - 94 . Springer, 2000 . Giulia Bernardini , Nadia Pisanti, Solon P. Pissis , and Giovanna Rosone . Pattern matching on elastic-degenerate text with errors . In SPIRE , volume 10508 of LNCS , pages 74 - 90 . Springer , 2017 . Kirill Borozdin , Dmitry Kosolobov, Mikhail Rubinchik, and Arseny M. Shur . Palindromic Length in Linear Time . In CPM , volume 78 of LIPIcs , pages 23 : 1 - 23 : 12 . Schloss DagstuhlLeibniz-Zentrum fuer Informatik, 2017 . The Computational Pan-Genomics Consortium . Computational pan-genomics: status, promises and challenges . Briefings in Bioinformatics, pages 1 - 18 , 2016 . Maxime Crochemore , Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen . Covering problems for partial words and for indeterminate strings . Theoretical Computer Science , 698 : 25 - 39 , 2017 . Martin Farach . Optimal suffix tree construction with large alphabets . In FOCS , pages 137 - 143 . IEEE, 1997 . Gabriele Fici , Travis Gagie, Juha Kärkkäinen, and Dominik Kempa . A subquadratic algorithm for minimum palindromic factorization . Journal of Discrete Algorithms , 28 : 41 - 48 , 2014 . Martin C. Frith , Ulla Hansen, John L. Spouge, and Zhiping Weng . Finding functional sequence elements by multiple local alignment . Nucleic Acids Res ., 32 ( 1 ): 189 - 200 , 2004 . J. A. Gally and G. M. Edelman . The genetic control of immunoglobulin synthesis . Annual Review of Genetics , 6 ( 1 ): 1 - 46 , 1972 . Jiawei Gao and Russell Impagliazzo . Orthogonal vectors is hard for first-order properties on sparse graphs . Electronic Colloquium on Computational Complexity (ECCC) , 23 : 53 , 2016 . Roberto Grossi , Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari . On-line pattern matching on a set of similar texts . In CPM, LIPIcs. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik , 2017 . Dan Gusfield . Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology . Cambridge University Press, New York, NY, USA, 1997 . Costas S. Iliopoulos , Ritu Kundu, and Solon P. Pissis . Efficient pattern matching in elasticdegenerate texts . In LATA , volume 10168 of LNCS , pages 131 - 142 . Springer International Publishing, 2017 . Costas S. Iliopoulos and Jakub Radoszewski . Truly Subquadratic-Time Extension Queries and Periodicity Detection in Strings with Uncertainties . In CPM , volume 54 of LIPIcs , pages 8 : 1 - 8 : 12 , Dagstuhl , Germany, 2016 . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik . Sci., 62 ( 2 ): 367 - 375 , 2001 . Russell Impagliazzo , Ramamohan Paturi, and Francis Zane . Which problems have strongly exponential complexity ? J. Comput. Syst. Sci. , 63 ( 4 ): 512 - 530 , 2001 . IUPAC-IUB Commission on Biochemical Nomenclature. Abbreviations and symbols for nucleic acids, polynucleotides, and their constituents . Biochemistry , 9 ( 20 ): 4022 - 4027 , 1970 . Richard J. Lipton . On The Intersection of Finite Automata , pages 145 - 148 . Springer US, Boston, MA, 2010 . Glenn Manacher . A new linear-time “on-line” algorithm for finding the smallest initial palindrome of a string . Journal of the ACM , 22 ( 3 ): 346 - 351 , 1975 . Lee Ann McCue , William Thompson , Steven Carmack, Michael P. Ryan , Jun S. Liu, Victoria Derbyshire, and Charles E. Lawrence . Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes . Nucleic Acids Res ., 29 ( 3 ): 774 - 782 , 2001 . Brejnev Muhizi Muhire , Michael Golden, Ben Murrell, Pierre Lefeuvre, Jean-Michel Lett , Alistair Gray, Art YF Poon, Nobubelo Kwanele Ngandu, Yves Semegni, Emil Pavlov Tanov, et al. Evidence of pervasive biologically functional secondary structures within the genomes of eukaryotic single-stranded DNA viruses . Journal of virology , 88 ( 4 ): 1972 - 1989 , 2014 . Eugene W Myers . Approximate matching of network expressions with spacers . Journal of Computational Biology , 3 ( 1 ): 33 - 51 , 1996 . Nadia Pisanti , Henry Soldano, Mathilde Carpentier, and Joël Pothier . A relational extension of the notion of motifs: Application to the common 3d protein substructures searching problem . Journal of Computational Biology , 16 ( 12 ): 1635 - 1660 , 2009 .


This is a preview of a remote PDF: http://drops.dagstuhl.de/opus/volltexte/2018/9323/pdf/LIPIcs-WABI-2018-21.pdf

Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone. Degenerate String Comparison and Applications, LIPICS - Leibniz International Proceedings in Informatics, 2018, 21:1-21:14, DOI: 10.4230/LIPIcs.WABI.2018.21