Boundedness of Conjunctive Regular Path Queries

LIPICS - Leibniz International Proceedings in Informatics, Jul 2019

We study the boundedness problem for unions of conjunctive regular path queries with inverses (UC2RPQs). This is the problem of, given a UC2RPQ, checking whether it is equivalent to a union of conjunctive queries (UCQ). We show the problem to be ExpSpace-complete, thus coinciding with the complexity of containment for UC2RPQs. As a corollary, when a UC2RPQ is bounded, it is equivalent to a UCQ of at most triple-exponential size, and in fact we show that this bound is optimal. We also study better behaved classes of UC2RPQs, namely acyclic UC2RPQs of bounded thickness, and strongly connected UCRPQs, whose boundedness problem is, respectively, PSpace-complete and Pi_2^P-complete. Most upper bounds exploit results on limitedness for distance automata, in particular extending the model with alternation and two-wayness, which may be of independent interest.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://drops.dagstuhl.de/opus/volltexte/2019/10680/pdf/LIPIcs-ICALP-2019-104.pdf

Boundedness of Conjunctive Regular Path Queries

I C A L P Boundedness of Conjunctive Regular Path Queries Diego Figueira CNRS 0 1 LaBRI 0 1 Talence 0 1 France 0 1 Category Track B: Automata, Logic, Semantics, and Theory of Programming 0 Pablo Barcelo? Department of Computer Science, University of Chile , Santiago, Chile IMFD, Santiago , Chile 1 Miguel Romero Department of Computer Science, University of Oxford , Oxford , UK We study the boundedness problem for unions of conjunctive regular path queries with inverses (UC2RPQs). This is the problem of, given a UC2RPQ, checking whether it is equivalent to a union of conjunctive queries (UCQ). We show the problem to be ExpSpace-complete, thus coinciding with the complexity of containment for UC2RPQs. As a corollary, when a UC2RPQ is bounded, it is equivalent to a UCQ of at most triple-exponential size, and in fact we show that this bound is optimal. We also study better behaved classes of UC2RPQs, namely acyclic UC2RPQs of bounded thickness, and strongly connected UCRPQs, whose boundedness problem is, respectively, PSpace-complete and ?2P -complete. Most upper bounds exploit results on limitedness for distance automata, in particular extending the model with alternation and two-wayness, which may be of independent interest. 2012 ACM Subject Classification Theory of computation ? Database query languages (principles); Theory of computation ? Quantitative automata and phrases regular path queries; boundedness; limitedness; distance automata - Funding Barcel? is funded by the Millennium Inst. for Foundational Research on Data and Fondecyt 1170109, and Figueira by ANR project DELTA (grant ANR-16-CE40-0007) and ANR project QUID (grant ANR-18-CE40-0031). This project has received funding from the European Research Council (ERC) under the European Union?s Horizon 2020 research and innovation programme (grant agreement No 714532). The paper reflects only the authors? views and not the views of the ERC or the European Commission. The European Union is not liable for any use that may be made of the information contained therein. Acknowledgements We are grateful to Thomas Colcombet for helpful discussions and valuable ideas in relation to the results of Section 5. Boundedness is an important property of formulas in logics with fixed-point features. At the intuitive level, a formula ? in any such logic is bounded if its fixed-point depth, i.e., the number of iterations that are needed to evaluate ? on a structure A, is fixed (and thus it is independent of A). In databases and knowledge representation, boundedness is regarded as an interesting theoretical phenomenon with relevant practical implications [25, 8]. In ET A CS fact, while several applications in these areas require the use of recursive features, actual real-world systems are either not designed or not optimized to cope with the computational demands that such features impose. Bounded formulas, in turn, can be reformulated in non-recursive logics, such as FO, or even as a union of conjunctive queries (UCQ) when ? itself is positive. UCQs form the core of most systems for data management and ontological query answering, and, in addition, are the focus of advanced optimization techniques. It has also been experimentally verified in some contexts that recursive features encountered in practice are often used in a somewhat ?harmless? way, and that many of such queries are in fact bounded [23]. Thus, checking if a recursive formula ? is bounded, and building an equivalent non-recursive formula ?0 when the latter holds, are important optimization tasks. The study of boundedness for Datalog programs, i.e., the least fixed-point extension of the class of UCQs, received a lot of attention during the late 80s and early 90s. Two seminal results established that checking boundedness is undecidable in general for Datalog [22], but becomes decidable for monadic Datalog, i.e., those programs in which each intensional predicate is monadic [19]. The past few years have seen a resurgence of interest in boundedness problems. This is due, in part, to the development of the theory of cost automata over trees (both finite and infinite) in a series of landmark results, in particular relating to its limitedness problem. In a few words, cost automata are generalizations of finite automata associating a cost from N ? {?} to every input tree (instead of simply accepting or rejecting). The limitedness problem asks, given a cost automata, whether there is a uniform bound on the cost over all (accepting) input trees. Some deep results establish that checking limitedness is decidable for well-behaved classes of cost automata over trees [18, 35, 36, 7]. Remarkably, for several logics of interest the boundedness problem can be reduced to the limitedness for cost automata in such well-behaved classes. Those reductions have enabled powerful decidability results for the boundedness problem. As an example, it has been shown in this way that boundedness is decidable for monadic second-order logic (MSO) over structures of bounded treewidth [11], which corresponds to an extension of Courcelle?s Theorem, and also for the guarded negation fragment of least fixed-point logic (LFP), even in the presence of unguarded parameters [6]. Cost automata have also been used to study the complexity of boundedness for guarded Datalog programs [7, 3]. Graph databases is a prominent area of study within database theory, in which the use of recursive queries is crucial [2, 1]. A graph database is a finite edge-labeled directed graph. The most basic navigational querying mechanism for graph databases corresponds to the class of regular path queries (RPQs), which check whether two nodes of the graph are connected by a path whose label belongs to a given regular language. RPQs are often extended with the ability to traverse edges in both directions, giving rise to the class of two-way RPQs, or 2RPQs [15]. The core of the most popular recursive query languages for graph databases is defined by conjunctive 2RPQs, or C2RPQs, which are the closure of 2RPQs under conjunction and existential quantifications [14]. We also consider unions of C2RPQs, or UC2RPQs. It can be shown that a UC2RPQ is bounded iff it is equivalent to some UCQ. In spite of the inherent recursive nature of UC2RPQs, their boundedness problem has not been studied in depth. Here we develop such a study by showing the following: The boundedness problem for UC2RPQs is ExpSpace-complete. The lower bound holds even for CRPQs. This implies that boundedness is not more difficult than containment for UC2RPQs, which was shown to be ExpSpace-complete in [14]. From our upper bound construction it follows that if a UC2RPQ is bounded, then it is equivalent to a UCQ of triple exponential size. We show that this bound is optimal. Finally, we obtain better complexity bounds for some subclasses of UC2RPQs; namely, for acyclic UC2RPQs of bounded thickness, in which case boundedness becomes PSpacecomplete, and for strongly connected UCRPQs, for which it is ?2P -complete. It is important to stress that UC2RPQs can be easily translated into guarded LFP with unguarded parameters, for which boundedness was shown to be decidable by applying sophisticated cost automata techniques as mentioned above. However, the complexity of the boundedness problem for such a logic is currently not well-understood ? and it is at least 2Exptime-hard [7] ? and hence this translation does not yield, in principle, optimal complexity bounds for our problem. To study the boundedness for UC2RPQs, we develop instead techniques especially tailored to UC2RPQs. In fact, since the recursive structure of UC2RPQs is quite tame, their boundedness problem can be translated into the limitedness problem for a much simpler automata model than cost automata on trees; namely, distance automata on finite words. Distance automata are nothing more than usual NFAs with two sorts of transitions: costly and non-costly. Such an automaton is limited if there is an integer k ? 1 such that every word accepted by the NFA has an accepting run with at most k costly transitions. A beautiful result in automata theory established the decidability of the limitedness problem for distance automata [24], which is actually in PSpace [29]. While being a difficult result, by now we have quite transparent proofs of this fact (see, e.g., [26]). We exploit our translation to obtain tight complexity upper bounds for boundedness of UC2RPQs. Some of the proofs in the paper require extending the study of limitedness to alternating and two-way distance automata, while preserving the PSpace bound for the limitedness problem. We believe these results to be of independent interest. Organization of the paper. Section 2 contains preliminaries. We present characterizations of boundedness for UC2RPQs in Section 3 and an application of those to pinpoint the complexity of Boundedness for RPQs in Section 4. Distance automata and results about them are given in Section 5. We analyze the complexity of Boundedness for general UC2RPQs in Section 6 and present some classes of UC2RPQs with better complexity of Boundedness in Section 7. We finish with a discussion in Section 8. 2 Preliminaries We assume familiarity with non-deterministic finite automata (NFA), two-way NFA (2NFA), and alternating finite automata (AFA) over finite words. We often blur the distinction between an NFA A and the language L(A) it defines; similarly for regular expressions. Graph databases and conjunctive regular path queries. A graph database over a finite alphabet A is a finite edge-labelled graph G = (V, E) over A, where V is a finite set of vertices and E ? V ? A ? V is the set of labelled edges. We write u ??a v to denote an edge (u, a, v) ? E. We define the alphabet A? := A ?? A?1 that extends A with the set A?1 := {a?1 | a ? A} of ?inverses? of symbols in A. An oriented path from u to v in a graph database G = (V, E) over alphabet A is a pair ? = (?, `) where ? and ` are (possibly empty) sequences ? = (v0, a1, v1), (v1, a2, v2), . . . , (vk?1, ak, vk) ? V ? A ? V , and ` = `1, . . . , `k ? {?1, 1}, for k ? 0, such that u = v0, v = vk, and for each 1 ? i ? k, we have that `i = 1 implies (vi?1, ai, vi) ? E; and `i = ?1 implies (vi, ai, vi?1) ? E. The label of ? is the word b1 . . . bk ? (A?)?, where bi = ai if `i = 1; otherwise bi = ai?1. When k = 0 the label of ? is the empty word ?. If `i = 1 for every 1 ? i ? k, we say that ? is a directed path. Note that in this case, the label of ? belongs to A?. A regular path query (RPQ) over A is a regular language L ? A?, which we assume to be given as an NFA. The evaluation of L on a graph database G = (V, E) over A, written L(G), is the set of pairs (u, v) ? V ? V such that there is a directed path from u to v in G whose label belongs to L. 2RPQs extend RPQs with the ability to traverse edges in both directions. Formally, a 2RPQ L over A is simply an RPQ over A?. The evaluation L(G) of L over a graph database G = (V, E) over A is the set of pairs (u, v) ? V ? V such that there is an oriented path from u to v in G whose label belongs to L. Conjunctive 2RPQs (C2RPQs) are obtained by taking the closure of 2RPQs under conjunction and existential quantification, i.e., a C2RPQ over A is an expression ? := ?z? (x1 ?L??1 y1) ? ? ? ? ? (xm ??? ym) , where each Li is a 2RPQ over A and z? is a tuple Lm of variables among those in {x1, y1, . . . , xm, ym}. We say that ? is a CRPQ if each Li is an RPQ. If x? = (x1, . . . , xn) is the tuple of free variables of ?, i.e., those that are not existentially quantified in z?, then the evaluation ?(G) of the C2RPQ ? over a graph database G is the set of all tuples h(x?) = (h(x1) . . . , h(xn)), where h ranges over all mappings h : {x1, y1, . . . , xm, ym} ? V such that (h(xi), h(yi)) ? Li(G) for each 1 ? i ? m. A union of C2RPQs (UC2RPQ) is an expression of the form ? := W1?i?n ?i, where the ?i?s are C2RPQ, all of which have exactly the same free variables. The evaluation ?(G) of ? over a graph database G is S1?i?n ?i(G). We often write ?(x?) to denote that x? is the tuple of free variables of ?. A UC2RPQ ? is Boolean if it contains no free variables. Given UC2RPQs ? and ?0, we write ? ? ?0 if ?(G) ? ?0(G) for each graph database G. Hence, ? and ?0 are equivalent if ? ? ?0 and ?0 ? ?, i.e., ?(G) = ?0(G) for every G. Boundedness of UC2RPQs. CRPQs, and even UC2RPQs, can easily be expressed in Datalog, the least fixed-point extension of the class of union of conjunctive queries (UCQs). Hence, we can directly define the boundedness of a UC2RPQ in terms of the boundedness of its equivalent Datalog program, which is a well-studied problem [25]. The latter, however, coincides with being equivalent to some UCQ [31]. In the setting of graph databases, a conjunctive query (CQ) over A is simply a CRPQ over A of the form ?z? V1?i?m(xi ?a?i yi) ? . Notice that atoms of the form x ??? y correspond to where the ais range over A ? { } equality atoms x = y. Analogously, one can define unions of CQs (UCQs). Note that, modulo equality atoms, a CQ over A can be seen as a graph database over A. Hence, we shall slightly abuse notation and, in the setting of CQs, use notions defined for graph databases (such as oriented paths). A UC2RPQ ? is bounded if it is equivalent to some UCQ ?. In this article we study the complexity of the problem Boundedness, which takes as input a UC2RPQ ? and asks whether ? is bounded. I Example 1. Consider the Boolean UCRPQ ? = ?1 ? ?2 over the alphabet A = {a, b, c, d} Lb,d Lb,d such that ?1 = ?x, y (x ?L?b y ? x ???? y) and ?2 = ?x, y (x ?L??d y ? x ???? y), where Lb := a+b+c, Ld := ad+c+, and Lb,d := a+(b + d)c+. For e ? A, recall that e+ denotes the language e(e?). As we shall explain in Example 4, we have that ?1 and ?2 are unbounded. However, ? is bounded, and in particular, it is equivalent to the UCQ ? = ?1 ? ?2, where ?1 and ?2 correspond to ?x, y (x ?a?b?c y) and ?x, y (x ?a?d?c y), respectively. J 3 Characterizations of Boundedness for UC2RPQs In this section we provide two simple characterizations of when a UC2RPQ is bounded that will be useful to analyze the complexity of Boundedness. Let ?(x?) and ?0(x?) be CQs over A with variable sets V and V0, respectively. Let =? and =?0 be the binary relations induced on V and V0 by the equality atoms of ? and ?0, respectively, and =?? and =??0 be their reflexive-transitive closure. A homomorphism from ? to ?0 is a mapping h : V ? V0 such that: (i) x =?? y implies h(x) =??0 h(y); (ii) h(x?) = x?; and (iii) for each atom x ??a y in ? with a ? A, there is an atom x0 ??a y0 in ?0 such that h(x) =??0 x0 and h(y) =??0 y0. We write ? ? ?0 if such a homomorphism exists. It is known that ? ? ?0 iff ?0 ? ? [16]. An expansion of a C2RPQ ?(x?) over A is a CQ ?(x?) over A with minimal number of variables and atoms such that (i) ? contains each variable of ?, (ii) for each atom A = x ??L y of ?, there is an oriented path ?A in ? from x to y with label wA ? L whose intermediate variables (i.e., those not in {x, y}) are distinct from one another, and (iii) intermediate variables of different oriented paths ?A and ?A0 are disjoint. Note that the free variables of ? and ? coincide. Intuitively, the expansion ? is obtained from ? by choosing for each atom A = x ??L y a word wA ? L, and ?expanding? x ??L y into the ?fresh oriented path? ?A from x to y with label wA. When wA = ? then ? contains the equality atom x = y. An expansion of a UC2RPQ ? is an expansion of some C2RPQ in ?. Observe that a (U)C2RPQ is always equivalent to the (potentially infinite) UCQ given by its set of expansions. Even more, it is equivalent to the UCQ defined by its minimal expansions, as introduced below. If ? is an expansion of a UC2RPQ ?, we define the size of ?, denoted by k?k, to be the number of (non-equality) atoms in ?. We say that ? is minimal, if there is no expansion ?0 such that ?0 ? ? and k?0k < k?k. Intuitively, an expansion is minimal if its answers cannot be covered by a smaller expansion. We can then establish the following. I Lemma 2. Every UC2RPQ ? is equivalent to the (potentially infinite) UCQ given by its set of minimal expansions. We can now provide our basic characterizations of boundedness. I Proposition 3. The following conditions are equivalent for each UC2RPQ ?. 1. ? is bounded. 2. There is k ? 1 such that for every expansion ? of ? there exists an expansion ?0 of ? with k?0k ? k such that ? ? ?0 (i.e., such that ?0 ? ?). 3. ? has finitely many minimal expansions. and ?x, y (x ?a?d?c y ? x ?a?d?c y). I Example 4. Consider the Boolean UCRPQ ? = ?1 ??2 over A = {a, b, c, d} from Example 1. To see that ?1 is unbounded (the case of ?2 is similar) we can apply Proposition 3. Indeed, the expansions of ?1 corresponding to {?x, y (x ?a?b?n?c y ? x ?a?d?c y) : n ? 1} are all minimal. On the other hand, ? is bounded as its minimal expansions correspond to ?x, y (x ?a?b?c y?x ?a?b?c y) J 4 Boundedness for Existentially Quantified RPQs As a first application of Proposition 3, we study Boundedness for CRPQs consisting of a single RPQ; that is, RPQs or existentially quantified RPQs. Let v, w be words over A. Recall that a word v is a prefix [resp. suffix and factor ] of w if w ? v ? A? [resp. w ? A? ? v and w ? A? ? v ? A?]. If in addition we have v 6= w, then we say that v is a proper prefix [resp. suffix and factor] of w. For a language L ? A?, we define its prefix-free sub-language Lpf to be the set of words w ? L such that w has no proper prefix in L. Similarly, we define Lsf and Lff with respect to the suffix and factor relation. We have the following: 3. A Boolean CRPQ ?x, y(x ??L y) with x 6= y is bounded iff Lff is finite. I Proposition 5. The following statements hold. 1. An RPQ L is bounded iff L is finite. 2. A CRPQ ?y(x ??L y) [resp. ?x(x ??L y)] with x 6= y is bounded iff Lpf [resp. Lsf] is finite. I Theorem 6. The problem of, given an NFA accepting the language L, checking whether Lpf is finite is PSpace-complete. The same holds if we replace Lpf by Lsf or Lff. Proof. We focus on upper bounds, the lower bounds are in the appendix. Given an NFA A accepting the language L, we can construct an NFA B of polynomial size in A that accepts precisely those words that have a proper prefix in L. By complementing and intersecting with A, we obtain an NFA B0 of exponential size in A that accepts the language Lpf. Hence, we only need to check whether the language accepted by B0 is finite, which can be done on-the-fly in NL w.r.t. B0, and hence in PSpace. The other two cases are analogous. J By applying Theorem 6 and Proposition 5, we can now pinpoint the complexity of Boundedness for CRPQs with a single RPQ. I Corollary 7. The following statements hold. 1. Boundedness for RPQs is NL-complete. 2. Boundedness for CRPQs of the form ?y(x ??L y), with x =6 y, is PSpace-complete. The same holds for CRPQs ?x(x ??L y) and Boolean CRPQs ?x, y(x ??L y), where x 6= y. It is not clear, though, how usual automata techniques, as the ones applied in the proof of Theorem 6, can be used to solve Boundedness for more complex CRPQs. To solve this problem we develop an approach based on distance automata, as introduced next. Our approach also handles inverses and unions, thus dealing with arbitrary UC2RPQs. 5 Distance Automata Distance automata [24] (equivalent to weighted automata over the (min, +)-semiring [21], min-automata [12], or {?, ic}-B-automata [17]) are an extension of finite automata which associate to each word in the language a natural number or ?cost?. They can be represented as non-deterministic finite automata with two sorts of transitions: costly and non-costly. For a given distance automaton, the cost of a run on a word is the number of costly transitions, and the cost of a word w ? A? is the minimum cost of an accepting run on w. We will use this automaton model to encode boundedness as the problem of whether there is a uniform bound on the cost of words, known as the limitedness problem. Formally, a distance automaton (henceforth DA) is a tuple A = (A, Q, q0, F, ?), where A is a finite alphabet, Q is a finite set of states, q0 ? Q is the initial state, F ? Q is the set of finals states and ? ? Q ? A ? {0, 1} ? Q is the transition relation. A word w ? A? is accepted by A if there is an accepting run of A on w, i.e., a (possibly empty) sequence of transitions ? = (p1, a1, c1, r1) ? ? ? (pn, an, cn, rn) ? ?? with the usual properties: (1) if ? = ? then q0 ? F and w = ?, (2) p1 = q0 and rn ? F , (3) for every 1 ? i < n we have ri = pi+1, and (4) w = a1 ? ? ? an. The cost of the run ? is cost(?) = c1 + ? ? ? + cn (or 0 if ? = ?); and the cost costA(w) of a word w accepted by A is the minimum cost of an accepting run of A on w. For convenience, we assume the cost of words not accepted by A to be 0. The limitedness problem for DA is defined as follows: given a DA A, determine whether supw?A? costA(w) < ?. This problem is known to be PSpace-complete. I Theorem 8 ([28, 29]). The following statements hold: 1. The limitedness problem for DA is PSpace-complete. 2. If a DA with n states is limited, then supw?A? costA(w) ? 2O(n3). We use two extensions of DA: alternating and two-way. Two-way DA is defined as for NFA, extending the cost function accordingly. The cost of a word is still the minimum over the cost of all (potentially infinitely many) runs. Alternating DA is defined as usual by having two sorts of states: universal and existential. Existential states can be seen as computing the minimum among the cost of all possible continuations of the run, and universal states as computing the maximum (or supremum if the automaton is also two-way). As we will see, these extensions preserve the above PSpace upper bound for the limitedness problem. Formally, an alternating two-way DA with epsilon transitions (A2DA?) over A is a tuple A = (A, Q?, Q?, q0, F, ?) is an A2DA? if q0 ? Q?, F ? Q? and ? ? (Q? ? Q?) ? (A? ? {?}) ? {end, end} ? {0, 1} ? (Q? ? Q?); where end indicates that after reading the letter we arrive at the end of the word (i.e., either the leftmost or the rightmost end) and end indicates that we do not. When the automaton A is two-way, it is convenient to think of its head as being between the letter positions of the word, so an end-flagged transition can be applied only if it moves the head to be right before the first letter of the word, or right after the last one. For any given word w ? A?, consider the edge-labelled graph GA,w = (V, E) over ?, where V = Q ? {0, . . . , |w|}, with Q = Q? ? Q?, and E ? V ? ? ? V consists of all edges (q, i) ?(?q,?a?,e?,c?,p?) (p, j) such that e = end iff j = 0 or j = |w| and either (a) i < |w|, a = w[i + 1], and j = i + 1; (b) i > 0, a = (w[i])?1, and j = i ? 1; or (c) a = ? and j = i. An accepting run of A on w from (q, i) ? Q ? {0, . . . , |w|} is a finite (possibly empty) edge-labelled directed rooted tree1 t over ? and a labelling h from the nodes of t to the nodes of GA,w, such that if t is empty then q ? F , and otherwise h maps the root of t to (q, i), every leaf of t to F ? {0, . . . , |w|}, and for every node x of t: if (x, ?, y) is an (labeled) edge in t for some y, then (h(x), ?, h(y)) is an edge in GA,w; if h(x) ? Q? ? {0, . . . , |w|}, then for every edge (h(x), ?, c) in GA,w, there is an edge (x, ?, y) in t so that h(y) = c; if h(x) ? Q? ? {0, . . . , |w|}, then x has at most one child. Each branch of t with label (q1, a1, e1, c1, p1), . . . , (qn, an, en, cn, pn) has an associated cost of c1 + ? ? ? + cn; and the cost associated with t is the maximum among the costs of its branches, or 0 if t is empty. The cost costA(w, q, i) is the minimum cost of an accepting run on w from (q, i), or 0 if none exists; costA(w) is defined as costA(w, q0, 0). An A2DA? with ? ? Q ? (A ? {?}) ? {end, end} ? {0, 1} ? Q is an alternating DA with ? transitions (ADA?). An A2DA? with Q? = ? is a two-way DA with ? transitions (2DA?). An A2DA with both the aforementioned conditions is (equivalent to) a DA with ? transitions (DA?). Notice that in the last two cases, accepting runs can be represented as words from ?? rather than trees. By A2DA (resp., ADA, 2DA, DA) we denote an A2DA? (resp., ADA?, 2DA?, DA?) with no ?-transitions. Note that DA as just defined is in every sense equivalent to the distance automata model we have defined at the beginning of this section ? this is why we overload the same ?DA? name. We first observe that 2DA can be transformed into DA while preserving both the language and limitedness problems by adapting the standard ?crossing sequence? construction for translating 2NFA into NFA [34]. This fact will be useful for proving the ExpSpace upper bound for Boundedness of general UC2RPQs in Section 6. I Proposition 9. There is an exponential time procedure which for every 2DA A over A produces a DA B over A such that the languages accepted by A and B are the same, and costB(w) ? costA(w) ? f (costB(w)) for every w ? A?, where f is a polynomial function that depends on the number of states of A. 1 That is, a tree-shaped finite edge-labelled graph over ? with edges directed in the root-to-leaf sense. Recall that the universality problem for NFAs is known to be PSpace-complete [27]; and that this bound actually extends to two-way and even alternating automata. We show that, likewise, the limitedness problem remains in PSpace for A2DA?. This result will be useful to show in Section 7 that Boundedness for the class of acyclic UC2RPQs of bounded thickness is in PSpace. I Theorem 10. The limitedness problem for A2DA? is PSpace-complete. The novelty of this result is the PSpace upper bound. In fact, decidability follows from known results, and in particular [7, Theorem 14] claims ExpTime-membership in the more challenging setup of infinite trees. However, this is obtained via an involved construction spanning through several papers. The proof of Theorem 10, instead, is obtained by the composition of the following reductions: lim. A2DA? ?(?1?) lim. A2DA ?(?2?) lim. 2DA ?(?3?) lim. ADA? ?(?4?) lim. ADA ?(?5?) lim. DA. Reductions (1), (3) and (4) are in polynomial time, while reductions (2) and (5), which are basically the same, are in exponential time. Specifically, reductions (2) and (5) preserve the statespace but the size of the alphabet grows exponentially in the number of states and linearly in the size of the source alphabet. However, the alphabet and transition set resulting from these reductions can be succinctly described: letters are encoded in polynomial space, and checking for membership in the transition set is polynomial time computable. In summary, the composition (1)+(2)+(3)+(4)+(5) yields a DA with the following characteristics: (i) it has a polynomial number of states Q; (ii) it runs on an exponential alphabet A ?and every letter is encoded in polynomial space?; and (iii) one can check in polynomial time whether a tuple t ? Q ? A ? {end, end} ? {0, 1} ? Q is in its transition relation. This, coupled with Theorem 8, item (2) (which offers a bound depending only on the number of states), provides a polynomial space algorithm for the limitedness of A2DA?: We can non-deterministically check the existence of a word with cost greater than the single exponential bound N using only polynomial space, by guessing one letter at a time and keeping the set of reachable states together with the associated costs, where each cost is encoded in binary using polynomial space if it is smaller than N , or with a ??? flag otherwise. The algorithm accepts if at least one final state is reached and the costs of all reachable final states are marked ?. Since NPSpace =PSpace (Savitch?s Theorem), Theorem 10 follows. We now provide a brief description of the reductions used in the proof of Theorem 10. (1) From A2DA? to A2DA. This is a trivial reduction obtained by simulating ?-transitions by reading a ? a?1 for some a ? A. (2) From A2DA to 2DA. Given an A2DA A = (A, Q?, Q?, q0, F, ?), we build a 2DA B over a larger alphabet B, where we trade alternation for extra alphabet letters. The alphabet B consists of triples (f ?, a, f ?), where a ? A and f ?, f ? : Q? ? ?. The idea is that f ?, f ? are ?choice functions? for the alternation: whenever we are to the left (resp., right) of a position of the word labelled (f ?, a, f ?) in state q ? Q?, instead of exploring all transitions departing from q and taking the maximum cost over all such runs (this is what alternation does in A), B chooses to just take the transition f ?(q) (resp., f ?(q)). Note that B is exponential in the number of states but not in the size of A. In this way, we build a 2DA B having the same set of states as A but with a transition function which is essentially deterministic on the states of Q?. In the end we obtain that for every w ? B?, costB(w) ? costA(wA); and for every w ? A? there is we ? B? so that weA = w and costA(w) = costB(w), e where wA and weA denote the projections onto the alphabet A. This implies that the limitedness problem is preserved. (3) From 2DA to ADA?. We show a polynomial-time translation from 2DA to ADA? which preserves limitedness. In the case of finite automata, there are language-preserving reductions from 2NFA to AFA with a quadratic blowup in the statespace [9, 32]. However, these translations, when applied blindly to reduce from 2DA to ADA?, preserve neither the cost semantics nor the limitedness of languages. On the other hand, [10] shows an involved construction that results in a reduction from 2DA to ADA? on infinite trees, which preserves limitedness but it is not polynomial in the number of states. We show a translation from 2DA to ADA? which serves our purpose: it preserves limitedness and it is polynomial time computable. The translation is close to the language-preserving reduction from 2NFA to AFA of [32], upgraded to take into account the cost of different alternation branches, somewhat in the same spirit as the history summaries from [10]. (4) From ADA? to ADA. This is a straightforward polynomial time reduction which preserves limitedness but ? as opposed to (1) ? does not preserve the language: we need to add an extra letter to the alphabet in order to make the reduction work in polynomial time. (5) From ADA to DA. This is exactly the same reduction as (2), noticing that the alphabet will still be single exponential in the original A2DA?. 6 Complexity of Boundedness for UC2RPQs Here we show that Boundedness for UC2RPQs is ExpSpace-complete. We do so by applying distance automata results presented in the previous section on top of the semantic characterizations presented in Section 3. The lower bound applies even for CRPQs. We further show that there is a triple exponential tight bound for the size of the equivalent UCQ of a UC2RPQ (and even CRPQ), whenever this exists. This is summarized in the following theorem. If ? is a UC2RPQ, we write k?k for the length of an arbitrary reasonable encoding of ? ? in particular, encodings in which regular languages are described through NFA or regular expressions. I Theorem 11. The following statements hold. 1. Boundedness for UC2RPQs is ExpSpace-complete. The problem remains ExpSpacehard even for Boolean CRPQs. 2. If a UC2RPQ ? is bounded, there is a UCQ ? that is equivalent to ? and such that ? has at most triple-exponentially many CQs, each one of which is at most of double exponential size with respect to k k ? . 3. There is a family {?n}n?1 of Boolean CRPQs such that for each n ? 1 it is the case that: (1) k?nk = O(n), (2) ?n is bounded, and (3) every UCQ that is equivalent to ?n has at least triple-exponentially many CQs with respect to n. 6.1 Upper bounds Our upper bound proof builds on top of techniques developed by Calvanese et al. [14] for studying the containment problem for UC2RPQs: Given UC2RPQs ?, ?0, is it the case that ? ? ?0? It is shown in [14] that from ?, ?0 it is possible to construct exponentially sized NFAs A?,?0 and A0?,?0 , such that ? ? ?0 iff there is a word in A?,?0 ? A0?,?0 . It is a well-known result that the latter is solvable in NL in the combined size of (A?,?0 , A0?,?0 ), i.e., in ExpSpace. We modify this construction to study the boundedness of a given UC2RPQ ?. In particular, we construct from ? in exponential time a DA D? such that ? is bounded iff D? is limited. The result then follows from Theorem 8, which establishes that limitedness for D? can be solved in polynomial space on the number of its states, and thus in ExpSpace. I Proposition 12. There is a single exponential time procedure that takes as input a UC2RPQ ? and constructs a DA D? such that ? is bounded iff D? is limited. Proof. Similarly as done in [14], the DA D? will run over encodings of expansions of the UC2RPQ ?, i.e., words over the alphabet A1 := A? ? V ? {$}, where A is the alphabet of ?, V is the set of variables of ?, and $ is a fresh symbol. If ? = ?z? V1?i?m(xi ?L?i yi) is a C2RPQ in ? and ? is the expansion of ? obtained by expanding each xi ?L?i yi into an oriented path ?i from xi to yi with label wi ? Li, then we encode ? as the word w? = $x1w1y1$x2w2y2$ ? ? ? $xmwmym$ ? A1? Note how the subword xiwiyi encodes the oriented path ?i. Every position j ? {1, . . . , |w?|} with w?[j] 6= $ represents a variable in ?: either xi or yi if w?[j] = xi or w?[j] = yi, respectively; or the (` + 1)-th variable in the oriented path ?i if w?[j] is the `-th symbol in the subword wi. Hence different positions in w? could represent the same variable in ?, e.g., in the encoding $xabcy$, the 5th position containing a ?c? and the 6th position containing a ?y?, represent the same variable, namely, the last vertex y of the oriented path. It is then easy to build, in polynomial time, an NFA A1 over A1 recognizing the language of all such encodings of expansions of ?. Our automaton D? is the product of A1 and the DA C? defined below. In particular, D? is limited iff C? is limited over words of the form w?, for ? an expansion of ?. Fix a disjunct ? of ?. As in [14], we consider words over the alphabet A2 := A1 ?(2V ?{#}) of the form (`1, ?1) ? ? ? (`n, ?n), such that w? = `1 ? ? ? `n, for some expansion ? of ?, and the ?i?s are valid ?-annotations, i.e., (1) ?i = # if `i = $, (2) ?1, . . . , ?n ? 2V induce a partition of the variable set V? of ?, and (3) for each free variable x ? V? there is some (`i, ?i) such that `i = x and x ? ?i. It is easy to construct an NFA B1? of exponential size that given w = (`1, ?1) ? ? ? (`n, ?n) with w? = `1 ? ? ? `n, checks if the ?i?s are valid ?-annotations. Note that if the latter holds, then the annotations encode a mapping hw from V? to the variables of ? such that hw(x?) = x?, where x? are the free variables of ?. Now, given w = (`1, ?1)(`2, ?2) ? ? ? (`n, ?n) with w? = `1 ? ? ? `n and the ?i?s being valid ?-annotations, it is shown in [14] that one can construct in polynomial time a 2NFA B2? that checks the existence of an expansion ?0 of ? and a homomorphism h from ?0 to ? consistent with hw. For each atom x ??L y of ?, the automaton B2? guesses an oriented path ? in ? from hw(x) to hw(y) with label w0 ? L, directly over the encoding w? starting at a position jx and ending at a position jy in {0, . . . , n} (recall that the head moves in {0, . . . , n}) with jx, jy > 0, w[jx] = (`, ?), w[jy] = (`0, ?0), x ? ? and y ? ?0. Note that we have two types of transitions: (1) transitions that consume a ? A? and actually guess an atom of ?, and (2) transitions to ?jump? from position j to j0 in {0, . . . , n} representing equivalent variables of ?. The latter means that j, j0 > 0 and either w?[j] and w?[j0] represents exactly the same variable of ?, or w?[j] and w?[j0] represent variables z, z0 of ? such that z =?? z0, where =? ? is the reflexive-transitive closure of the relation induced by the equality atoms in ?. Let D2? be the 2DA obtained from the 2NFA B2? by setting to 0 and 1 the cost of transitions of type (2) and (1), respectively. Hence, for a word w such that the projection of w to A1 is w?, and the one to (2V ? {#}) is a valid ?-annotation, we have that costD2? (w) is precisely the minimum size of an expansion ?0 that can be mapped to ? via a homomorphism compatible with hw. By Proposition 9, we can construct in exponential time in D2? a DA C2? accepting the same language as D2? and having an exponential number of states, so that for every word w0, wbee thhaeverecsouslttCo2?f(wta0)ki?ngcothsteDp2?r(owd0)uc?t fof(cBo1?stCa2?n(dwC0)2?) afonrdstohmene pproolyjencotminigalovfuenrctthioenalfp.haLbeett?AC1?. For every expansion ? of ?, if ?0 is a minimal size expansion of ? such that ?0 ? ?, then we obtain that cost?C? (w?) ? k?0k ? f (cost?C? (w?)). We define our desired C? to be the union of ?C? over all ? in ?. We have that for every expansion ?, if ?min is a minimal size expansion of ? such that ?min ? ?, then costC? (w?) ? k?mink ? f (costC? (w?)). By Proposition 3, item (2), ? is bounded iff k?mink is bounded over all ?. The latter condition holds iff C? is limited over words w?, for all expansion ?. By definition, the latter is equivalent to D? being limited. Summing up, we obtain that ? is bounded iff D? is limited, as required. Note that the whole construction can be done in exponential time. J As a corollary to Proposition 12 and Theorem 8 we obtain the desired upper bound for part (1) of Theorem 11. I Corollary 13. Boundedness for UC2RPQs is in ExpSpace. Size of equivalent UCQs. Here we prove part (2) of Theorem 11. Since ? is bounded we have from Proposition 12 that D? is limited. Then, from Theorem 8 we obtain that the maximum cost that it takes D? over a word is N , where N is exponential in the number of states of D?, and thus double exponential in k?k by construction. Therefore, for every expansion ? of ?, if ?min is a minimal size expansion ? such that ?min ? ?, then k?mink ? f (N ), where f is the polynomial function of the proof of Proposition 12. In particular, all minimal expansions of ? are of size ? f (N ). By Lemma 2, the UC2RPQ ? is equivalent to the union of all its minimal expansions. The number of such minimal expansions is thus at most exponential in f (N ), and hence triple exponential in k?k. 6.2 Lower bounds We reduce from the 2n-tiling problem, that is, a tiling problem restricted to 2n many columns, which is ExpSpace-complete (see, e.g., [14]). We show that for every 2n-tiling problem T there is a CRPQ ?, computable in polynomial time from T , whose number of minimal expansions is essentially the number of solutions to T in the following sense. I Lemma 14. For every 2n-tiling problem T with m solutions there is a Boolean CRPQ ?, computable in polynomial time from T , such that the number of minimal expansions of ? is O((g(|T |) + m)n+1) and ?(m), for some double exponential function g. Further, ? consists of a Boolean CRPQ of the form ?x, y V0?i?n(x ?L?i y), where each Li is given as a regular expression. As a corollary, this yields an ExpSpace lower bound for the boundedness problem (part (1) of Theorem 11), as well as a triple exponential lower bound for the size of the UCQ equivalent to any bounded CRPQ (part (3) of Theorem 11), since one can produce 2n-tiling problems having triple-exponentially many solutions. 7 Better-behaved Classes of UC2RPQs Here we present two restrictions of UC2RPQs that exhibit a better behavior in terms of the complexity of Boundedness than the general case, namely, acyclic UC2RPQs of bounded thickness and strongly connected UCRPQs. The improved bounds are PSpace and ?2P , respectively, which turn out to be optimal. Acyclic UC2RPQs of Bounded Thickness. For any two distinct variables x, y of a C2RPQ ?, we denote by Atoms? (x, y) the set of atoms in ? of the form x ??L y or y ??L x. The thickness of a C2RPQ ? is the maximum cardinality of a set of the form Atoms? (x, y), for x, y variables of ? with x 6= y. The thickness of a UC2RPQ ? is the maximum thickness over all the C2RPQs in ?. The underlying undirected graph of ? has as vertex set the set of variables of ? and contains an edge {x, y} iff x 6= y and Atoms? (x, y) 6= ?. A C2RPQ ? is acyclic if its underlying undirected graph is an acyclic graph (i.e., a forest). A UC2RPQ ? is acyclic if each C2RPQ in ? is. We show next that Boundedness for acyclic UC2RPQs of bounded thickness is PSpacecomplete. These classes of UC2RPQs have been previously studied in the literature [4, 5]. In particular, it follows from [5, Theorem 4.2] that the containment problem for the acyclic UC2RPQs of bounded thickness is PSpace-complete, and hence Theorem 15 below shows that Boundedness is not more costly than containment for these classes. I Theorem 15. Fix k ? 1. The problem Boundedness is PSpace-complete for acyclic UC2RPQs of thickness at most k. Proof (sketch). The lower bound follows directly from PSpace-hardness of Boundedness for RPQs (see Corollary 7). For the PSpace upper bound, we follow a similar strategy as in the case of arbitrary UC2RPQs (Section 6.1), i.e., we reduce boundedness of ? to DA limitedness. The main difference is that, since ? is acyclic, we can exploit the power of alternation and construct an A2DA? B (instead of a 2DA, as in the proof of Proposition 12), such that ? is bounded iff B is limited. The constant upper bound on the thickness of ? implies that B is actually of polynomial size. The result follows then as limitedness of an A2DA? can be decided in PSpace in virtue of Theorem 10. J Both conditions in Theorem 15, i.e., acyclicity and bounded thickness, are necessary. Indeed, it follows from Lemma 14 that Boundedness is ExpSpace-hard even for: Boolean acyclic CRPQs. Boolean CRPQs of thickness one, whose underlying undirected graph is of treewidth two. Recall that the treewidth is a measure of how much a graph resembles a tree (cf., [20]) ? acyclic graphs are precisely the graphs of treewidth one. Indeed, the CRPQs of the form ?x, y Vi(x ?L?i y) used in Lemma 14 are Boolean and acyclic (but have unbounded thickness). Replacing each (x ?L?i y) with (x ??? zi) ? (zi ?L?i y), yields an equivalent CRPQ of thickness one whose underlying undirected graph has treewidth two. Strongly Connected UCRPQs. We conclude this section with an even better behaved class of CRPQs in terms of Boundedness. Unlike the previous case, the definition of this class depends on the underlying directed graph of a CRPQ ?. This contains a directed edge from variable x to y iff there is an atom in ? of the form x ??L y. A CRPQ ? is strongly connected if its underlying directed graph is strongly connected, i.e., every pair of variables is connected by some directed path. A UCRPQ ? is strongly connected if every CRPQ in ? is. We can then establish the following. I Theorem 16. Boundedness is ?2P -complete for strongly connected UCRPQs. 8 Discussion and Future Work The main conclusion of our work is that techniques previously used in the study of containment of UC2RPQs can be naturally leveraged to pinpoint the complexity of Boundedness by using DA instead of NFA. This, however, requires extending results on limitedness to alternating and two-way DA. For all the classes of UC2RPQs studied in the paper we show in fact that the complexity of Boundedness coincides with that of the containment problem. We leave open what is the exact size of UCQ rewritings for the classes of acyclic UC2RPQs of bounded thickness and the strongly connected UCRPQs that are bounded. The most natural next step is to study Boundedness for the class of regular queries (RQs), which are the closure of UC2RPQs under binary transitive closure. RQs are one of the most powerful recursive languages for which containment is decidable in elementary time. In fact, containment of RQs has been proved to be 2EXPSPACE-complete by applying sophisticated techniques based on NFA [33]. We will study if it is possible to settle the complexity of Boundedness for RQs with the help of DA techniques. Another interesting future line of work is the study of Boundedness for UC2RPQs based on the restricted classes of regular expressions often found in practical applications [13]. As it has been shown lately, the complexity of some query evaluation problems is alleviated under this restriction [30], and it would be nice to see if the same holds for the boundedness problem. This would be good news for the applicability of boundedness techniques in practical applications. In fact, it would be an indication that the high complexity lower bounds obtained in this paper are mostly witnessed by complicated interactions between regular expressions not commonly arising in practice. 1 2 3 4 5 6 7 8 9 104:13 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 36 Renzo Angles , Marcelo Arenas, Pablo Barcel?, Aidan Hogan, Juan L. Reutter , and Domagoj Vrgo? . Foundations of Modern Query Languages for Graph Databases . ACM Computing Surveys , 50 ( 5 ): 68 : 1 - 68 : 40 , 2017 . Pablo Barcel? . Querying graph databases . In ACM Symposium on Principles of Database Systems (PODS) , pages 175 - 188 , 2013 . Pablo Barcel? , Gerald Berger, Carsten Lutz, and Andreas Pieris . First-Order Rewritability of Frontier-Guarded Ontology-Mediated Queries . In International Joint Conference on Artificial Intelligence (IJCAI) , pages 1707 - 1713 , 2018 . Pablo Barcel? , Miguel Romero, and Moshe Y. Vardi . Does Query Evaluation Tractability Help Query Containment? In ACM Symposium on Principles of Database Systems (PODS) , pages 188 - 199 , 2014 . SIAM Journal on computing , 45 ( 4 ): 1339 - 1376 , 2016 . Michael Benedikt , Pierre Bourhis, and Michael Vanden Boom. A Step Up in Expressiveness of Decidable Fixpoint Logics . In Annual IEEE Symposium on Logic in Computer Science (LICS) , pages 817 - 826 , 2016 . Michael Benedikt , Balder ten Cate, Thomas Colcombet, and Michael Vanden Boom. The Complexity of Boundedness for Guarded Logics . In Annual IEEE Symposium on Logic in Computer Science (LICS) , pages 293 - 304 . IEEE Computer Society Press, 2015 . doi: 10 .1109/LICS. 2015 . 36 . Meghyn Bienvenu , Peter Hansen , Carsten Lutz, and Frank Wolter . First Order-Rewritability and Containment of Conjunctive Queries in Horn Description Logics . In International Joint Conference on Artificial Intelligence (IJCAI) , pages 965 - 971 , 2016 . Jean-Camille Birget . State-complexity of finite-state devices, state compressibility and incompressibility . Mathematical systems theory , 26 ( 3 ): 237 - 269 , 1993 . Hing Leung . Limitedness Theorem on Finite Automata with Distance Functions: An Algebraic Proof . Theoretical Computer Science (TCS) , 81 ( 1 ): 137 - 145 , 1991 . doi: 10 .1016/ 0304 - 3975 ( 91 ) 90321 - R . Hing Leung and Viktor Podolskiy . The limitedness problem on distance automata: Hashiguchi's method revisited . Theoretical Computer Science (TCS) , 310 ( 1-3 ): 147 - 158 , 2004 . doi:10. 1016/ S0304 - 3975 ( 03 ) 00377 - 3 . Wim Martens and Tina Trautner . Evaluation and Enumeration Problems for Regular Path Queries . In International Conference on Database Theory (ICDT) , pages 19 : 1 - 19 : 21 , 2018 . Jeffrey F. Naughton . Data Independent Recursion in Deductive Databases . Journal of Computer and System Sciences (JCSS) , 38 ( 2 ): 259 - 289 , 1989 . Nir Piterman and Moshe Y. Vardi . From bidirectionality to alternation . Theoretical Computer Science (TCS) , 295 : 295 - 321 , 2003 . doi: 10 .1016/S0304- 3975 ( 02 ) 00410 - 3 . Theoretical Computer Science (TCS) , 61 ( 1 ): 31 - 83 , 2017 . John C. Shepherdson. The reduction of two-way automata to one-way automata . IBM Journal of Research and Development , 3 ( 2 ): 198 - 200 , 1959 . Michael Vanden Boom . Weak Cost Monadic Logic over Infinite Trees . In International Symposium on Mathematical Foundations of Computer Science (MFCS) , pages 580 - 591 , 2011 . Michael Vanden Boom . Weak cost automata over infinite trees . PhD thesis , University of Oxford, UK, 2012 .


This is a preview of a remote PDF: http://drops.dagstuhl.de/opus/volltexte/2019/10680/pdf/LIPIcs-ICALP-2019-104.pdf

Pablo Barcel\'o, Diego Figueira, Miguel Romero. Boundedness of Conjunctive Regular Path Queries, LIPICS - Leibniz International Proceedings in Informatics, 2019, 104:1-104:15, DOI: 10.4230/LIPIcs.ICALP.2019.104