Non-LR(1) Precedence Cascade Grammars (Short Paper) (pdf)

Article PDF cannot be displayed. You can download it here:

http://drops.dagstuhl.de/opus/volltexte/2018/9269/pdf/OASIcs-SLATE-2018-11.pdf

Non-LR(1) Precedence Cascade Grammars (Short Paper)

Non-LR(1) Precedence Cascade Grammars José-Luis Sierra Fac. Informática. Universidad Complutense de Madrid C/ Prof. José García Santesmases 9. 28040 Madrid, Spain https://orcid.org/0000-0002-0317-0510 Abstract Precedence cascade is a well-known pattern for writing context-free grammars (CFGs) that model the syntax of expression languages. According to this method, precedence levels are represented by non-terminals, and operators’ attributes are used to write syntax rules properly. In most cases, the resulting precedence cascade grammar (PCG) has neat properties that facilitate its implementation. In particular, many PCGs are LR(1) grammars, which serve as input for conventional bottom-up parser generators. However, for some cumbersome operator tables the method does not produce such neat grammars. This paper focuses on these cumbersome operator tables by identifying several conditions leading to non-LR(1) PCGs. 2012 ACM Subject Classification Software and its engineering → Syntax Keywords and phrases grammarware, expression grammars, grammar patterns, grammar ambiguity, LR grammars Digital Object Identifier 10.4230/OASIcs.SLATE.2018.11 Category Short Paper Funding This work is supported by the project grants TIN2014-52010-R and TIN2017-88092 R. 1 Introduction Most computer languages include an expression sub-language as their most distinctive feature. This sub-language allows users to begin with a repertoire of primitive expressions and create more complex expressions by combining simpler ones. Such a combination is carried out by operators [13]. In this paper we will focus only in the most common classes of operators: binary infix, and unary prefix and postfix operators. In addition, we will adopt the conventions of the Prolog language to describe the attributes for these operators [5]: Each operator will have a name (e.g., +, −, ∗ . . . ). It will be possible to overload this name, allowing different operator definitions to share such a name. Each operator will belong to a precedence level. Each precedence level will be represented by a positive natural number. Operators in lower precedence levels will take priority over (i.e., will bind tighter than) operators in higher ones1 . In addition, when an operator is used to build an expression, this expression will take the precedence level for that operator. Precedence levels for basic expressions will be 0. 1 That is, following Prolog conventions, in this paper precedence and priority of operators will be contravariant properties. © José-Luis Sierra; licensed under Creative Commons License CC-BY 7th Symposium on Languages, Applications and Technologies (SLATE 2018). Editors: Pedro Rangel Henriques, José Paulo Leal, António Leitão, and Xavier Gómez Guinovart Article No. 11; pp. 11:1–11:8 OpenAccess Series in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany 11:2 Non-LR(1) Precedence Cascade Grammars Name Precedence Type E3 → ⊗E3 | E2 ⊕ E3 | E2 ⊗ ⊕ ⊗ ⊗ 3 3 2 2 1 fy xfy yfx xfx yf E2 → E2 E1 | E1 ⊗ E1 | E1 E1 → E1 ⊗ | E0 E0 → a | (E3 ) (a) Operator table for a sample expression language. (b) PCG for the descriptions in Table 1a; it is an LR(1) grammar. Figure 1 An operator table and its associated PCG. Operators will constrain the precedence levels of their arguments to be: (i) lower than their own precedence level (denoted by x in the description of the operator’s argument), or (ii) lower or equal than such a precedence level (which will be denoted by y). The fixity and the arguments’ allowed precedences will together form the operator’s syntactic type. Following Prolog convention, this type will be one of the following forms: (i) for infix operators, yf x, xf y, xf x; (ii) for prefix operators, f y, f x; and (iii) for postfix operators, yf , xf . This way, yf x operators are left-associative, xf y right-associative, and xf x non-associative. In turn, f y and yf are associative, while xf and f x are non-associative unary (prefix and postfix) operators. All this information can be condensed into an operator table for the language. Table 1a gives an example of an operator table2 . To model the syntax of this kind of expression languages, it is possible to use a precedence cascade pattern, which is described to a greater or lesser extent in any typical textbook on compiler construction (e.g., [3, 8]). In order to describe the pattern, we will introduce the following notation: By ↓ (i) we will denote the precedence level immediately smaller than i, or 0 if i is the smallest precedence level. By > we will denote the greatest precedence level. The pattern itself is based on the following conventions (Figure 1b shows the CFG that results from applying these conventions to the Table 1a): Each precedence level i has a non-terminal Ei associated with it that represents expressions built with operators at that level. Each operator in level i has a rule associated with it that characterizes the syntax of the expressions formed with that operator. This rule depends on the operator’s type: (i) Ei → Ei E↓(i) if the type is yf x; (ii) Ei → E↓(i) Ei if it is xf y; (iii) Ei → E↓(i) E↓(i) if xf x; (iv) Ei → Ei if f y; (v) Ei → E↓(i) if f x; (vi) Ei → Ei if yf ; and (vii) Ei → E↓(i) if the type is xf . There is an additional rule Ei → E↓(i) for each level i. Finally, there is a non-terminal symbol E0 that models the basic (i.e., literals, variables, function calls, etc.) and parenthesized expressions. In the sequel we will abstract all the basic expressions with a single a symbol. Thus, there will be an additional pair of rules E0 → a | (E> ) 2 Notice that, according to this operator table, an expression like “⊗a ⊕ a ⊕ a ⊗ a⊗” will mean “⊗(a ⊕ (a ⊕ (a ⊗ (a⊗))))”, while another one like “a ⊕ ⊗a” will be ill-formed (it should be written “a ⊕ (⊗a)”). J. L. Sierra 11:3 E2 Name Prec. Type 2 1 yfx xfx E2 → E2 E1 | E1 E1 → E0 E0 | E0 E0 → a | (E2 ) E2 E1 E2 E1 E1 E0 E0 E0 E0 a a a a (a) Operator table with multiple (b) PCG resulting of the oper- (c) Two different parse trees for “a definitions of the infix operator . ator table presented in Table 2a. a”. Figure 2 Example regarding multiple operator definitions with the same name and fixity. We will refer to the CFGs produced by this pattern as precedence cascade grammars (PCGs). A well-known example of using this pattern for a real programming language is Jeff Lee’s YACC grammar for ANSI C3 . For most operator tables, the PCGs are LR(1) grammars [6] suitable for typical bottomup, YACC-like, parser generators (this is the case, for instance, of the PCG in Figure 1b)4 . However, there are also operator tables that lead to non-LR(1) grammars. Most of the time, this is due to contradictory operator definitions, which in turn produce ambiguous PCGs. Other times, such contradictions do not exist, but even so the resulting PCGs require more than one look-ahead symbol. In this paper we address t (...truncated)