The longest letter-duplicated subsequence and related problems
Acta Informatica
https://doi.org/10.1007/s00236-024-00459-7
ORIGINAL ARTICLE
The longest letter-duplicated subsequence and related
problems
Wenfeng Lai1 · Adiesha Liyanage2 · Binhai Zhu2 · Peng Zou2
Received: 9 December 2023 / Accepted: 11 July 2024
© The Author(s) 2024
Abstract
Motivated by computing duplication patterns in sequences, a new problem called the longest
letter-duplicated subsequence (LLDS) is proposed. Given a sequence S of length n, a letterduplicated subsequence is a subsequence of S in the form of x1d1 x2d2 . . . xkdk with xi ∈ ,
x j = x j+1 and di ≥ 2 for all i in [k] and j in [k − 1]. A linear time algorithm for computing
a longest letter-duplicated subsequence (LLDS) of S can be easily obtained. In this paper, we
focus on two variants of this problem: (1) ‘all-appearance’ version, i.e., all letters in must
appear in the solution, and (2) the weighted version. For the former, we obtain dichotomous
results: We prove that, when each letter appears in S at least 4 times, the problem and a relaxed
version on feasibility testing (FT) are both NP-hard. The reduction is from (3+ , 1, 2− )SAT, where all 3-clauses (i.e., containing 3 lals) are monotone (i.e., containing only positive
literals) and all 2-clauses contain only negative literals. We then show that when each letter
appears in S at most 3 times, then the problem admits an O(n) time algorithm. Finally,
we consider the weighted version, where the weight of a block xidi (di ≥ 2) could be any
positive function which might not grow with di . We give a non-trivial O(n 2 ) time dynamic
programming algorithm for this version, i.e., computing an LD-subsequence of S whose
weight is maximized.
Adiesha Liyanage, Binhai Zhu and Peng Zou have contributed equally.
B Binhai Zhu
Wenfeng Lai
Adiesha Liyanage
Peng Zou
1
School of Computer Science and Technology, Shandong University, Qingdao 266237, Shandong,
China
2
Gianforte School of Computing, Montana State University, Bozeman, MT 59717, USA
123
W. Lai et al.
1 Introduction
In biology, duplication is an important part of evolution. There are two kinds of duplications:
arbitrary segmental duplications (i.e., select a segment and paste it somewhere else) and
tandem duplications (which is in the form of X → X X , where X is any segment of the input
sequence). It is known that the former duplications occur frequently in cancer genomes [4,
16, 20]. On the other hand, the latter are common under different scenarios, for example, it is
known that the tandem duplication of 3 nucleotides CAG is closely related to the Huntington
disease [15]. In addition, tandem duplications can occur at the genome level (acrossing
different genes) for certain types of cancer [17]. In fact, as early as in 1980, Szostak and
Wu provided evidence that gene duplication is the main driving force behind evolution, and
the majority of duplications are tandem [21]. Consequently, it was not a surprise that in the
first sequenced human genome around 3% of the genetic contents are in the form of tandem
repeats [13].
Independently, tandem duplications were also studied in copying systems [7]; as well as in
formal languages [2, 5, 22]. In 2004, Leupold et al. posed a fundamental question regarding
tandem duplications: what is the complexity to compute the minimum tandem duplication
distance between two sequences A and B (i.e., the minimum number of tandem duplications
to convert A to B). In 2020, Lafond et al. [9] answered this open question by proving that this
problem is NP-hard for an unbounded alphabet. In fact, Lafond et al. proved later that the
problem is NP-hard even if || ≥ 4 by encoding each letter in the unbounded alphabet proof
with a square-free string over a new alphabet of size 4 (modified from Leech’s construction
[14]), which covers the case most relevant with biology, i.e., when = {A, C, G, T} (for
DNA sequences) or = {A, C, G, U} (for RNA sequences) [11]. Independently, Cicalese
and Pilati showed that the problem is NP-hard for || = 5 using a different encoding method
[3].
Motivated by the above applications (especially when some mutations occur after the
duplications), some new problems related to duplications are proposed and studied in this
paper. Given a sequence S of length n, a letter-duplicated subsequence (LDS) of S is a
subsequence of S in the form x1d1 x2d2 . . . xkdk with xi ∈ , where x j = x j+1 and di ≥ 2
for all i in [k] and j in [k − 1]. (Each xidi is called an LD-block.) Naturally, the problem of
computing a longest letter-duplicated subsequence (LLDS) of S can be defined, and a simple
linear time algorithm can be obtained. An example can show the idea behind this problem:
B = A AC AC AG AT G AT , and due to local mutations, insertions and deletions it becomes
S = A AC AC GT C G AT , but a longest letter-duplicated subsequence X 1 = A ACC GG or
X 2 = A ACC T T would still give us the skeleton of the initial sequence B. (Recently, Lafond
et al. [10] have considered a slightly more complex version but the corresponding running
times are significantly higher. In the conclusion section, we will discuss that perspective a
little more.)
We remark that recently a similar problem called longest run subsequence was studied by
Schrinner et al. [18, 19], it differs from our problem in that each letter appears consecutively
at most once in the solution as a run (which is a substring containing one or more repetitions
of the same letter), and the goal is the same, i.e., the length of such a subsequece is to be
maximized. For this problem, additional results on FPT intractability can be found in [6] and
additional approximation results can be found in [1].
In this paper, we focus on some important variants around the LLDS problem, focusing
on the constrained and weighted cases. The constraint is to demand that all letters in
appear in a resulting LDS, which simulates that in a genome with duplicated genes, how to
123
The longest letter-duplicated subsequence...
compute the maximum duplicated pattern while including all the genes. Then we have two
problems: feasibility testing (FT for short, which decides whether an LDS of S containing
all letters in exists) and the problem of maximizing the length of a resulting LDS where all
letters in the alphabet appear, which we call LLDS+. It turns out that the status of these two
problems change quite a bit when d, the maximum number a letter can appear in S, varies.
We denote the corresponding problems as FT(d) and LLDS+(d) respectively. Let |S| = n,
we summarize our main results in this paper as follows:
1. We show that when d ≥ 4, both FT(d) and (the decision version of) LLDS+(d) are NPcomplete, which implies that LLDS+(d) does not have a polynomial-time approximation
algorithm when d ≥ 4.
2. We show that when d = 3, both FT(d) and LLDS+(d) admit an O(n) time algorithm,
by exploiting a new property of the problem.
3. When a weight of an LD-block is any positive function (i.e., it does not even have t (...truncated)