The longest letter-duplicated subsequence and related problems (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s00236-024-00459-7.pdf

The longest letter-duplicated subsequence and related problems

Acta Informatica https://doi.org/10.1007/s00236-024-00459-7 ORIGINAL ARTICLE The longest letter-duplicated subsequence and related problems Wenfeng Lai1 · Adiesha Liyanage2 · Binhai Zhu2 · Peng Zou2 Received: 9 December 2023 / Accepted: 11 July 2024 © The Author(s) 2024 Abstract Motivated by computing duplication patterns in sequences, a new problem called the longest letter-duplicated subsequence (LLDS) is proposed. Given a sequence S of length n, a letterduplicated subsequence is a subsequence of S in the form of x1d1 x2d2 . . . xkdk with xi ∈ , x j = x j+1 and di ≥ 2 for all i in [k] and j in [k − 1]. A linear time algorithm for computing a longest letter-duplicated subsequence (LLDS) of S can be easily obtained. In this paper, we focus on two variants of this problem: (1) ‘all-appearance’ version, i.e., all letters in must appear in the solution, and (2) the weighted version. For the former, we obtain dichotomous results: We prove that, when each letter appears in S at least 4 times, the problem and a relaxed version on feasibility testing (FT) are both NP-hard. The reduction is from (3+ , 1, 2− )SAT, where all 3-clauses (i.e., containing 3 lals) are monotone (i.e., containing only positive literals) and all 2-clauses contain only negative literals. We then show that when each letter appears in S at most 3 times, then the problem admits an O(n) time algorithm. Finally, we consider the weighted version, where the weight of a block xidi (di ≥ 2) could be any positive function which might not grow with di . We give a non-trivial O(n 2 ) time dynamic programming algorithm for this version, i.e., computing an LD-subsequence of S whose weight is maximized. Adiesha Liyanage, Binhai Zhu and Peng Zou have contributed equally. B Binhai Zhu Wenfeng Lai Adiesha Liyanage Peng Zou 1 School of Computer Science and Technology, Shandong University, Qingdao 266237, Shandong, China 2 Gianforte School of Computing, Montana State University, Bozeman, MT 59717, USA 123 W. Lai et al. 1 Introduction In biology, duplication is an important part of evolution. There are two kinds of duplications: arbitrary segmental duplications (i.e., select a segment and paste it somewhere else) and tandem duplications (which is in the form of X → X X , where X is any segment of the input sequence). It is known that the former duplications occur frequently in cancer genomes [4, 16, 20]. On the other hand, the latter are common under different scenarios, for example, it is known that the tandem duplication of 3 nucleotides CAG is closely related to the Huntington disease [15]. In addition, tandem duplications can occur at the genome level (acrossing different genes) for certain types of cancer [17]. In fact, as early as in 1980, Szostak and Wu provided evidence that gene duplication is the main driving force behind evolution, and the majority of duplications are tandem [21]. Consequently, it was not a surprise that in the first sequenced human genome around 3% of the genetic contents are in the form of tandem repeats [13]. Independently, tandem duplications were also studied in copying systems [7]; as well as in formal languages [2, 5, 22]. In 2004, Leupold et al. posed a fundamental question regarding tandem duplications: what is the complexity to compute the minimum tandem duplication distance between two sequences A and B (i.e., the minimum number of tandem duplications to convert A to B). In 2020, Lafond et al. [9] answered this open question by proving that this problem is NP-hard for an unbounded alphabet. In fact, Lafond et al. proved later that the problem is NP-hard even if || ≥ 4 by encoding each letter in the unbounded alphabet proof with a square-free string over a new alphabet of size 4 (modified from Leech’s construction [14]), which covers the case most relevant with biology, i.e., when = {A, C, G, T} (for DNA sequences) or = {A, C, G, U} (for RNA sequences) [11]. Independently, Cicalese and Pilati showed that the problem is NP-hard for || = 5 using a different encoding method [3]. Motivated by the above applications (especially when some mutations occur after the duplications), some new problems related to duplications are proposed and studied in this paper. Given a sequence S of length n, a letter-duplicated subsequence (LDS) of S is a subsequence of S in the form x1d1 x2d2 . . . xkdk with xi ∈ , where x j = x j+1 and di ≥ 2 for all i in [k] and j in [k − 1]. (Each xidi is called an LD-block.) Naturally, the problem of computing a longest letter-duplicated subsequence (LLDS) of S can be defined, and a simple linear time algorithm can be obtained. An example can show the idea behind this problem: B = A AC AC AG AT G AT , and due to local mutations, insertions and deletions it becomes S = A AC AC GT C G AT , but a longest letter-duplicated subsequence X 1 = A ACC GG or X 2 = A ACC T T would still give us the skeleton of the initial sequence B. (Recently, Lafond et al. [10] have considered a slightly more complex version but the corresponding running times are significantly higher. In the conclusion section, we will discuss that perspective a little more.) We remark that recently a similar problem called longest run subsequence was studied by Schrinner et al. [18, 19], it differs from our problem in that each letter appears consecutively at most once in the solution as a run (which is a substring containing one or more repetitions of the same letter), and the goal is the same, i.e., the length of such a subsequece is to be maximized. For this problem, additional results on FPT intractability can be found in [6] and additional approximation results can be found in [1]. In this paper, we focus on some important variants around the LLDS problem, focusing on the constrained and weighted cases. The constraint is to demand that all letters in appear in a resulting LDS, which simulates that in a genome with duplicated genes, how to 123 The longest letter-duplicated subsequence... compute the maximum duplicated pattern while including all the genes. Then we have two problems: feasibility testing (FT for short, which decides whether an LDS of S containing all letters in exists) and the problem of maximizing the length of a resulting LDS where all letters in the alphabet appear, which we call LLDS+. It turns out that the status of these two problems change quite a bit when d, the maximum number a letter can appear in S, varies. We denote the corresponding problems as FT(d) and LLDS+(d) respectively. Let |S| = n, we summarize our main results in this paper as follows: 1. We show that when d ≥ 4, both FT(d) and (the decision version of) LLDS+(d) are NPcomplete, which implies that LLDS+(d) does not have a polynomial-time approximation algorithm when d ≥ 4. 2. We show that when d = 3, both FT(d) and LLDS+(d) admit an O(n) time algorithm, by exploiting a new property of the problem. 3. When a weight of an LD-block is any positive function (i.e., it does not even have t (...truncated)