JAL: an algebra for JSON query optimization (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-025-01336-0.pdf

JAL: an algebra for JSON query optimization

World Wide Web (2025) 28:28 https://doi.org/10.1007/s11280-025-01336-0 JAL: an algebra for JSON query optimization Anne Jasmijn Langerak1 · Flavius Frasincar1 · Jasmijn Klinkhamer1 Received: 20 September 2024 / Revised: 20 February 2025 / Accepted: 26 February 2025 © The Author(s) 2025 Abstract As databases become larger and less structured, the JavaScript Object Notation (JSON) data format has risen in usage compared to other data formats like XML. At the same time, while extracting data from these large datasets efficiently is of obvious importance, there has been far less research regarding the optimization of JSON queries than there has relating to the querying of XML data. Thus a JSON Data Model and JSON Algebra (JAL) are proposed, as well as a heuristic optimization algorithm, for the purpose of improving the efficiency of queries of JSON data. We implement the proposed algorithm and compare the efficiency gain that it provides in terms of both the theoretical and physical cost of executing queries. We find that the algorithm significantly reduces query costs compared to an unoptimized baseline. Additionally, we find that the efficiency gain is considerably larger when querying databases with many documents than those with relatively fewer documents. Keywords JSON · Query optimization · JSONiq · Databases 1 Introduction As Big Data increasingly finds its way into the practices of companies across a widening spectrum of industries, the interest in efficient data processing has increased substantially [1]. Of particular importance is the extraction of desired information from large datasets, where the use of queries can be especially costly when not done efficiently. At present, there are two data transfer types in common usage, XML [2] and JavaScript Object Notation (JSON) [3]. They share many similarities, as JSON was proposed after XML. However, in recent years JSON has become more widespread compared to XML, a situation heightened by the increase in usage of Representational State Transfer (REST) APIs. Such APIs depend on easy and fast data interchanges, and as JSON is a lightweight, easy to read, and easy to parse data format, it is particularly suited to these APIs. Although B Flavius Frasincar Anne Jasmijn Langerak Jasmijn Klinkhamer 1 Department of Econometrics, Erasmus University Rotterdam, Burgemeester Oudlaan 50, Rotterdam 3062 PA, the Netherlands 0123456789().: V,-vol 123 28 Page 2 of 42 World Wide Web (2025) 28:28 JSON is becoming increasingly more popular than XML, current literature still lacks the same depth of research as is available for XML. XML is known for having powerful validation and schema features, with an established set of query languages and resources. The querying of JSON is comparatively less consolidated, and there is still room for efficiency gain regarding the execution of JSON queries. The focus of this paper is the optimization of JSON queries through the algebraic manipulation of queries, i.e., the logical optimization of queries. To that end, a JSON data model, algebraic operators, equivalence rules, and a heuristic optimization algorithm are defined. Rewriting a query into algebraic operators allows for the utilization of equivalence rules following a heuristic algorithm, reducing the execution cost of the query. In order to reduce the number of computations required for query execution, appropriate equivalence expressions must be defined that allow for the removal of redundant computations in the query tree. Using these rules, a query algorithm may then be proposed to decrease the number of required computations for query execution, making the query more efficient. Similar objectives and strategies have been employed in relational database contexts [4], and XML databases [5]. Our contribution to the current state of the literature is extending the propositions made in both of these contexts, in particular those related to query optimization heuristics, into a JSON context. A broad variety of JSON databases is utilized that allow us to make a fair comparison regarding both theoretical (quantitative dimension) as well as physical (running time) computational costs of a query, with and without the proposed optimization algorithm. We express our queries in JSONiq [6], which is a well-known and extensive JSON query language. Based on our results, we conclude that our proposed algorithm, making use of our proposed data model, algebra, and equivalence rules, improves the execution of JSON queries both in terms of theoretical cost as well physical cost. Especially when dealing with databases that contain a large number of documents, a substantial difference can be observed in costs between executing a query with and without our optimization algorithm. The remainder of this paper is structured as follows. Section 2 covers a literature review of related research, discussing current research progress regarding JSON data models, JSON query languages (QL), and JSON algebras. In Section 3 we describe the five databases that we use to test our optimization algorithm. In Section 4 we describe our methodology: we propose a JSON data model, define JSON algebraic operators, define equivalence expressions, and derive an optimization algorithm. In Section 5 we present our main findings. Last, in Section 6, we draw our conclusions and make suggestions for future research. 2 Related work This section begins with a discussion of JSON data models. Then, the current state of the literature regarding JSON query languages and JSON algebra is evaluated. Last, we discuss existing optimization approaches and their connections to our work. 2.1 JSON data model Despite the prevalence of JSON in practical applications, there is no official standard for the modelling of JSON documents in the current state of the literature. In most JSON-related research, JSON documents are modelled by trees, i.e., JSON trees, where an important characteristic of the tree is that it is edge-labelled [7–9]. In [8], the structure of a simple JSON 123 World Wide Web (2025) 28:28 Page 3 of 42 28 tree is described. This structure is especially useful when combined with the JSONPath query language, in which queries select the nodes of a tree where specific path conditions are met. In [9], the authors describe a JSON document through an object description that contains path-value pairs instead of the key-value pairs that form the main structure of JSON. A path is defined as the sequence of keys that leads to a specific value separated by dots, hence a path-value pair is similar to a key-value pair. Converting a JSON document into an object description results in a document like the one presented in Figure 1. In the object description, nesting is no longer present due to the path-value pairs. For example, the first name of a student represented in object description in Figure 1 is given by the path name.first. As keys assure that a JSON document is determinis (...truncated)