Expressivity and Complexity of MongoDB Queries (pdf)

Article PDF cannot be displayed. You can download it here:

http://drops.dagstuhl.de/opus/volltexte/2018/8607/pdf/LIPIcs-ICDT-2018-9.pdf

Expressivity and Complexity of MongoDB Queries

Expressivity and Complexity of MongoDB Queries Elena Botoeva Faculty of Computer Science, Free University of Bozen-Bolzano, Italy Diego Calvanese Faculty of Computer Science, Free University of Bozen-Bolzano, Italy Benjamin Cogrel Faculty of Computer Science, Free University of Bozen-Bolzano, Italy Guohui Xiao1 Faculty of Computer Science, Free University of Bozen-Bolzano, Italy Abstract In this paper, we consider MongoDB, a widely adopted but not formally understood database system managing JSON documents and equipped with a powerful query mechanism, called the aggregation framework. We provide a clean formal abstraction of this query language, which we call MQuery. We study the expressivity of MQuery, showing the equivalence of its well-typed fragment with nested relational algebra. We further investigate the computational complexity of significant fragments of it, obtaining several (tight) bounds in combined complexity, which range from LogSpace to alternating exponential-time with a polynomial number of alternations. 2012 ACM Subject Classification Information systems → Semi-structured data, Theory of computation → Data modeling, Theory of computation → Database query languages (principles) Keywords and phrases MongoDB, NoSQL, aggregation framework, expressivity Digital Object Identifier 10.4230/LIPIcs.ICDT.2018.9 Related Version A full version of this paper with more details and selected proofs is available as a technical report [3]. Acknowledgements We thank Christoph Koch, Dan Suciu, Henrik Ingo, and Martin Rezk for helpful discussions. This research has been partially supported by the project “Ontology-based Data Access for NoSQL Databases” (OBDAM), funded through the 2016 call issued by the Research Committee of the Free University of Bozen-Bolzano. 1 Introduction JavaScript Object Notation (JSON) is currently adopted extensively as the de-facto standard format for representing nested data. JSON organizes data as semi-structured tree-shaped documents, with a minimalistic set of node types, and as such is commonly considered a lightweight alternative to XML. JSON documents can also be seen as complex values [11, 1, 9, 7], in particular due to the presence of nested arrays. Consider, e.g., the document 1 Corresponding author © Elena Botoeva, Diego Calvanese, Benjamin Cogrel, and Guohui Xiao; licensed under Creative Commons License CC-BY 21st International Conference on Database Theory (ICDT 2018). Editors: Benny Kimelfeld and Yael Amsterdamer; Article No. 9; pp. 9:1–9:23 Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany 9:2 Expressivity and Complexity of MongoDB Queries Listing 1 A sample JSON document in the bios collection. { " _id " : 4 , " awards " : [ { " award " : " Rosing Prize " , " year " : 1999 , " by " : " Norwegian Data Association " } , { " award " : " Turing Award " , " year " : 2001 , " by " : " ACM " } , { " award " : " IEEE John von Neumann Medal " , " year " : 2001 , " by " : " IEEE " } ] , " birth " : " 1926 -08 -27 " , " contribs " : [ " OOP " , " Simula " ] , " death " : " 2002 -08 -10 " , " name " : { " first " : " Kristen " , " last " : " Nygaard " } } in Listing 1, containing personal information (such as name and birth-date) about Kristen Nygaard, and information about the awards he received, the latter stored inside an array. Following its massive adoption by practitioners, recently JSON has also received attention in the database theory community. A powerful (Turing-complete, in its full generality) Datalog-like query language for JSON named JLogic is introduced in [12], where the expressive power and complexity of the full language and of significant fragments are studied. In [4], both JSON and its main schema language JSON Schema2 are formalized, and their expressive power and the computational complexity of basic computational tasks, such as satisfiability and evaluation of expressions, are studied. Although some of the latter results apply to the simple find query language3 of the widespread JSON-based document database system MongoDB, still little is known about the precise formal properties of the query languages for JSON with rich capabilities popular among practitioners, such as JSONiq [10] and SQL++ [16]. Differently from XML, where XQuery is the official standard query language, embraced also by the developer community, so far there is no standard query language for JSON. However, in terms of adoption, the MongoDB aggregation framework 4 is currently the most prominent language providing rich querying capabilities over collections of JSON documents, and hence has become the de-facto standard language for JSON. This language is modeled on the flexible notion of a data processing pipeline, where a query consists of multiple stages, each defining a transformation using a specific operator, applied to the set of documents produced by the previous stage. As such, the language is very expressive and rich in features, but it has been developed in an ad-hoc manner, resulting in some counter-intuitive behavior. Here, we propose a first study on the formal foundations and computational properties of the MongoDB aggregation framework. Since JSON documents can be seen as complex values and are closely related to XML documents, we expect the aggregation framework to have many similarities with well-known query languages for complex values, such as monad algebra [5, 15], nested relational algebra (NRA) [19, 8] and Core XQuery [15]. Our first contribution is a formalization of the JSON data model and of the aggregation framework query language. We aim at achieving a good balance between the contrasting requirements of capturing all aspects of MongoDB, and of keeping the formalization sufficiently simple and streamlined so as to allow for a formal study of the language properties. To do so, we deliberately abstract away some low-level features of MongoDB, which appear to be motivated by implementation aspects and possibly by ad-hoc choices, and we make some simplifying assumptions, commonly considered in database theory. Specifically, we adopt set semantics (as opposed to bag or list semantics), and we abstract away from order within 2 http://json-schema.org/ https://docs.mongodb.com/manual/crud/ 4 https://docs.mongodb.com/manual/core/aggregation-pipeline/ 3 E. Botoeva, D. Calvanese, B. Cogrel, and G. Xiao 9:3 documents. Our formal language, which we call MQuery, includes the match, unwind, project, group, and lookup operators, roughly corresponding to the NRA operators select, unnest, project, nest, and left join, respectively. In our investigation, we consider various fragments of MQuery, which we denote by Mα , where α consists of the initials of the stages allowed in the fragment. As a useful side-effect of our formalization effort, we point out different “features” exhibited by MongoDB’s query language that are somewhat counter-intui (...truncated)