JAL: an algebra for JSON query optimization
World Wide Web
(2025) 28:28
https://doi.org/10.1007/s11280-025-01336-0
JAL: an algebra for JSON query optimization
Anne Jasmijn Langerak1 · Flavius Frasincar1 · Jasmijn Klinkhamer1
Received: 20 September 2024 / Revised: 20 February 2025 / Accepted: 26 February 2025
© The Author(s) 2025
Abstract
As databases become larger and less structured, the JavaScript Object Notation (JSON) data
format has risen in usage compared to other data formats like XML. At the same time, while
extracting data from these large datasets efficiently is of obvious importance, there has been
far less research regarding the optimization of JSON queries than there has relating to the
querying of XML data. Thus a JSON Data Model and JSON Algebra (JAL) are proposed,
as well as a heuristic optimization algorithm, for the purpose of improving the efficiency
of queries of JSON data. We implement the proposed algorithm and compare the efficiency
gain that it provides in terms of both the theoretical and physical cost of executing queries.
We find that the algorithm significantly reduces query costs compared to an unoptimized
baseline. Additionally, we find that the efficiency gain is considerably larger when querying
databases with many documents than those with relatively fewer documents.
Keywords JSON · Query optimization · JSONiq · Databases
1 Introduction
As Big Data increasingly finds its way into the practices of companies across a widening
spectrum of industries, the interest in efficient data processing has increased substantially
[1]. Of particular importance is the extraction of desired information from large datasets,
where the use of queries can be especially costly when not done efficiently.
At present, there are two data transfer types in common usage, XML [2] and JavaScript
Object Notation (JSON) [3]. They share many similarities, as JSON was proposed after
XML. However, in recent years JSON has become more widespread compared to XML,
a situation heightened by the increase in usage of Representational State Transfer (REST)
APIs. Such APIs depend on easy and fast data interchanges, and as JSON is a lightweight,
easy to read, and easy to parse data format, it is particularly suited to these APIs. Although
B
Flavius Frasincar
Anne Jasmijn Langerak
Jasmijn Klinkhamer
1
Department of Econometrics, Erasmus University Rotterdam, Burgemeester Oudlaan 50, Rotterdam
3062 PA, the Netherlands
0123456789().: V,-vol
123
28
Page 2 of 42
World Wide Web
(2025) 28:28
JSON is becoming increasingly more popular than XML, current literature still lacks the same
depth of research as is available for XML. XML is known for having powerful validation and
schema features, with an established set of query languages and resources. The querying of
JSON is comparatively less consolidated, and there is still room for efficiency gain regarding
the execution of JSON queries.
The focus of this paper is the optimization of JSON queries through the algebraic
manipulation of queries, i.e., the logical optimization of queries. To that end, a JSON data
model, algebraic operators, equivalence rules, and a heuristic optimization algorithm are
defined. Rewriting a query into algebraic operators allows for the utilization of equivalence
rules following a heuristic algorithm, reducing the execution cost of the query.
In order to reduce the number of computations required for query execution, appropriate
equivalence expressions must be defined that allow for the removal of redundant computations
in the query tree. Using these rules, a query algorithm may then be proposed to decrease the
number of required computations for query execution, making the query more efficient.
Similar objectives and strategies have been employed in relational database contexts [4],
and XML databases [5]. Our contribution to the current state of the literature is extending the
propositions made in both of these contexts, in particular those related to query optimization
heuristics, into a JSON context. A broad variety of JSON databases is utilized that allow
us to make a fair comparison regarding both theoretical (quantitative dimension) as well
as physical (running time) computational costs of a query, with and without the proposed
optimization algorithm. We express our queries in JSONiq [6], which is a well-known and
extensive JSON query language.
Based on our results, we conclude that our proposed algorithm, making use of our proposed
data model, algebra, and equivalence rules, improves the execution of JSON queries both
in terms of theoretical cost as well physical cost. Especially when dealing with databases
that contain a large number of documents, a substantial difference can be observed in costs
between executing a query with and without our optimization algorithm.
The remainder of this paper is structured as follows. Section 2 covers a literature review
of related research, discussing current research progress regarding JSON data models, JSON
query languages (QL), and JSON algebras. In Section 3 we describe the five databases that
we use to test our optimization algorithm. In Section 4 we describe our methodology: we
propose a JSON data model, define JSON algebraic operators, define equivalence expressions,
and derive an optimization algorithm. In Section 5 we present our main findings. Last, in
Section 6, we draw our conclusions and make suggestions for future research.
2 Related work
This section begins with a discussion of JSON data models. Then, the current state of the
literature regarding JSON query languages and JSON algebra is evaluated. Last, we discuss
existing optimization approaches and their connections to our work.
2.1 JSON data model
Despite the prevalence of JSON in practical applications, there is no official standard for the
modelling of JSON documents in the current state of the literature. In most JSON-related
research, JSON documents are modelled by trees, i.e., JSON trees, where an important
characteristic of the tree is that it is edge-labelled [7–9]. In [8], the structure of a simple JSON
123
World Wide Web
(2025) 28:28
Page 3 of 42
28
tree is described. This structure is especially useful when combined with the JSONPath query
language, in which queries select the nodes of a tree where specific path conditions are met.
In [9], the authors describe a JSON document through an object description that contains
path-value pairs instead of the key-value pairs that form the main structure of JSON. A path
is defined as the sequence of keys that leads to a specific value separated by dots, hence a
path-value pair is similar to a key-value pair. Converting a JSON document into an object
description results in a document like the one presented in Figure 1. In the object description,
nesting is no longer present due to the path-value pairs. For example, the first name of a
student represented in object description in Figure 1 is given by the path name.first. As
keys assure that a JSON document is determinis (...truncated)