Expressivity and Complexity of MongoDB Queries
Expressivity and Complexity of MongoDB Queries
Elena Botoeva
Faculty of Computer Science, Free University of Bozen-Bolzano, Italy
Diego Calvanese
Faculty of Computer Science, Free University of Bozen-Bolzano, Italy
Benjamin Cogrel
Faculty of Computer Science, Free University of Bozen-Bolzano, Italy
Guohui Xiao1
Faculty of Computer Science, Free University of Bozen-Bolzano, Italy
Abstract
In this paper, we consider MongoDB, a widely adopted but not formally understood database
system managing JSON documents and equipped with a powerful query mechanism, called the
aggregation framework. We provide a clean formal abstraction of this query language, which we
call MQuery. We study the expressivity of MQuery, showing the equivalence of its well-typed
fragment with nested relational algebra. We further investigate the computational complexity of
significant fragments of it, obtaining several (tight) bounds in combined complexity, which range
from LogSpace to alternating exponential-time with a polynomial number of alternations.
2012 ACM Subject Classification Information systems → Semi-structured data, Theory of computation → Data modeling, Theory of computation → Database query languages (principles)
Keywords and phrases MongoDB, NoSQL, aggregation framework, expressivity
Digital Object Identifier 10.4230/LIPIcs.ICDT.2018.9
Related Version A full version of this paper with more details and selected proofs is available
as a technical report [3].
Acknowledgements We thank Christoph Koch, Dan Suciu, Henrik Ingo, and Martin Rezk for
helpful discussions. This research has been partially supported by the project “Ontology-based
Data Access for NoSQL Databases” (OBDAM), funded through the 2016 call issued by the
Research Committee of the Free University of Bozen-Bolzano.
1
Introduction
JavaScript Object Notation (JSON) is currently adopted extensively as the de-facto standard
format for representing nested data. JSON organizes data as semi-structured tree-shaped
documents, with a minimalistic set of node types, and as such is commonly considered
a lightweight alternative to XML. JSON documents can also be seen as complex values
[11, 1, 9, 7], in particular due to the presence of nested arrays. Consider, e.g., the document
1
Corresponding author
© Elena Botoeva, Diego Calvanese, Benjamin Cogrel, and Guohui Xiao;
licensed under Creative Commons License CC-BY
21st International Conference on Database Theory (ICDT 2018).
Editors: Benny Kimelfeld and Yael Amsterdamer; Article No. 9; pp. 9:1–9:23
Leibniz International Proceedings in Informatics
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
9:2
Expressivity and Complexity of MongoDB Queries
Listing 1 A sample JSON document in the bios collection.
{ " _id " : 4 ,
" awards " : [
{ " award " : " Rosing Prize " , " year " : 1999 , " by " : " Norwegian Data Association " } ,
{ " award " : " Turing Award " , " year " : 2001 , " by " : " ACM " } ,
{ " award " : " IEEE John von Neumann Medal " , " year " : 2001 , " by " : " IEEE " } ] ,
" birth " : " 1926 -08 -27 " ,
" contribs " : [ " OOP " , " Simula " ] ,
" death " : " 2002 -08 -10 " ,
" name " : { " first " : " Kristen " , " last " : " Nygaard " } }
in Listing 1, containing personal information (such as name and birth-date) about Kristen
Nygaard, and information about the awards he received, the latter stored inside an array.
Following its massive adoption by practitioners, recently JSON has also received attention
in the database theory community. A powerful (Turing-complete, in its full generality)
Datalog-like query language for JSON named JLogic is introduced in [12], where the expressive
power and complexity of the full language and of significant fragments are studied. In [4],
both JSON and its main schema language JSON Schema2 are formalized, and their expressive
power and the computational complexity of basic computational tasks, such as satisfiability
and evaluation of expressions, are studied. Although some of the latter results apply to
the simple find query language3 of the widespread JSON-based document database system
MongoDB, still little is known about the precise formal properties of the query languages
for JSON with rich capabilities popular among practitioners, such as JSONiq [10] and
SQL++ [16].
Differently from XML, where XQuery is the official standard query language, embraced
also by the developer community, so far there is no standard query language for JSON.
However, in terms of adoption, the MongoDB aggregation framework 4 is currently the most
prominent language providing rich querying capabilities over collections of JSON documents,
and hence has become the de-facto standard language for JSON. This language is modeled
on the flexible notion of a data processing pipeline, where a query consists of multiple stages,
each defining a transformation using a specific operator, applied to the set of documents
produced by the previous stage. As such, the language is very expressive and rich in features,
but it has been developed in an ad-hoc manner, resulting in some counter-intuitive behavior.
Here, we propose a first study on the formal foundations and computational properties
of the MongoDB aggregation framework. Since JSON documents can be seen as complex
values and are closely related to XML documents, we expect the aggregation framework to
have many similarities with well-known query languages for complex values, such as monad
algebra [5, 15], nested relational algebra (NRA) [19, 8] and Core XQuery [15].
Our first contribution is a formalization of the JSON data model and of the aggregation
framework query language. We aim at achieving a good balance between the contrasting
requirements of capturing all aspects of MongoDB, and of keeping the formalization sufficiently
simple and streamlined so as to allow for a formal study of the language properties. To do
so, we deliberately abstract away some low-level features of MongoDB, which appear to be
motivated by implementation aspects and possibly by ad-hoc choices, and we make some
simplifying assumptions, commonly considered in database theory. Specifically, we adopt set
semantics (as opposed to bag or list semantics), and we abstract away from order within
2
http://json-schema.org/
https://docs.mongodb.com/manual/crud/
4
https://docs.mongodb.com/manual/core/aggregation-pipeline/
3
E. Botoeva, D. Calvanese, B. Cogrel, and G. Xiao
9:3
documents. Our formal language, which we call MQuery, includes the match, unwind, project,
group, and lookup operators, roughly corresponding to the NRA operators select, unnest,
project, nest, and left join, respectively. In our investigation, we consider various fragments
of MQuery, which we denote by Mα , where α consists of the initials of the stages allowed
in the fragment. As a useful side-effect of our formalization effort, we point out different
“features” exhibited by MongoDB’s query language that are somewhat counter-intui (...truncated)