Once every few years I end up, for one reason or another, having to implement an operator precedence parser. Getting it right is a bit fiddly. I once found a great article about how you do it and tend to just follow that but a while ago that article disappeared.
Now, again, I find myself having to implement an operator precedence parser, in yet another language, and so I dug up the old article on archive.org and decided: since I find it super useful and it’s completely gone from the web, maybe I should just host a copy here. So here it is. It’s a design doc from the sugar parser library written by Douglas Gregor. So, to be clear, I didn’t write this doc I’m just reposting it here (with a few minimal formatting changes) because it’s good and that way it’s at least available somewhere.
If anyone, particularly Douglas Gregor, has an opinion or an objection please leave a comment.
Shift/Reduce Expression Parsing
Introduction
The Sugar expression parser is similar to the class of operator-precedence parsers, but has been extended to support common requirements when parsing expressions, such as function application, confix (grouping) operators, and operator name disambiguation. Additionally, Sugar is intended to be usable without any precompiling phase, making it ideal for rapid or on-the-fly construction of expression parsers.
Expressions
For the purposes of this document, an expression is a sequence of operators and operands, where operators fall into one of the following categories:
Type | Arity | Placement | Examples |
---|---|---|---|
Prefix | Unary | Prior to operand | Unary minus |
Postfix | Unary | After operand | Factorial |
Infix | Binary | Between operands | Addition, multiplication, and division |
Confix | Unary | Surrounding operand | Parentheses, half-open ranges |
Function application | Binary | After first operand and surrounding second operand | Mathemetical functions (sin(x)), array indexes(a[5]) |
The confix and function application operators are essentially split into their component parts, an open symbol and a close symbol, during the parsing phase. The “open” symbol will occur on the left-hand side and the “close” symbol will occur on the right-hand side.
Constructing the parser
The expression parser is a shift/reduce parser with zero lookahead that utilizes two separate stacks: one for operators and one for operands. Any operands in the input stream are immediately shifted onto the operator stack; operators are immediately shifted onto the operator stack only if the operator stack is empty. Otherwise, the following table determines the action of the parser depending on the type of the operator on top of the operator stack and on the type of the current operator token.
Current operator | ||||||||
---|---|---|---|---|---|---|---|---|
Prefix | Postfix | Infix | Confix Open | Confix/Function Close | Function Open | End of Input | ||
Top of Stack |
Prefix | shift | precedence | precedence | shift | reduce | precedence | reduce |
Postfix | – | reduce | reduce | – | reduce | reduce | reduce | |
Infix | shift | precedence | precedence/associativity | shift | reduce | precedence | reduce | |
Confix Open | shift | shift | shift | shift | shift | shift | reduce | |
Confix/Function Close | reduce | reduce | reduce | reduce | reduce | reduce | reduce | |
Function Open | shift | shift | shift | shift | shift | shift | reduce |
Description of parsing actions
- A shift operation pushes the current operator token onto the operator stack.
- A reduce operation pops the operator token off the top of the operator stack, and then pops the appropriate number of operands from the operand stack. Then the operator is applied to the operand(s) and the result is pushed back on the operand stack. Reduction of confix operators and of function application requires popping two operators off the operator stack.
- A precedence operation compares determines the relative precedence of the operator on top of the operator stack (top) and the current operator (current).
- If top has a lower precedence than current, shift.
- If top has a higher precedence than current, reduce.
- A precedence/associativity operation first compares the precedence according to the precedence operation: if the precedence is equivalent, associativity is considered:
- If top associates left of current, reduce.
- If top associates right of current, shift.
Rejecting Invalid Expresions
Operator-precedence parsers are often not used because they accept invalid strings. The shift-reduce parser as specified above will consider the expressions x + x, + x x, and x x + equivalent, even though only the first form is correct. This weakness is easily remedied with the use of the following state machine to track what type of operator or operand is expected at any given point in time.
Disambiguation of Operator Names
Within many domains, certain operators are reused in different contexts. Several obvious examples are the unary and binary minus operators that use the same symbol ‘-‘, the absolute-value confix operator that uses the symbol ‘|’ as both its open and close symbol, and the ‘+’ operator for regular expressions that is both a postfix positive closure operator and an infix operator for specifying alternatives.
Disambiguation of operator names is in many cases directly related to the state machine used to identify invalid sequences. Given any operator name, we determine the set of operator types that it may belong to. We then intersect this with the set of operator types that are valid at our current state within the state machine to determine role(s) this operator may play in this context. Several cases are left ambiguous by this intersection. These cases are considered below with either a specific resolution or are considered impossible by this class of parser.
Disambiguation at this phase requires lookahead of one additional token, and is also based on the state machine. Disambiguation is possible when the possible meanings of the operator differ in the states that will result from their interpretation. For instance, if a given operator is both postfix and infix, the postfix interpretation would remain in the post-operand state whereas the infix interpretation would transfer to the pre-operand state. Looking ahead one symbol, we can determine if the next symbol would be valid in either state: if it is valid in only one of the resulting states, we can disambiguate the prior (non-lookahead) symbol to ensure that the appropriate state is reached so that the lookahead symbol will not result in a parse error.
- {prefix, confix open}: ambiguous (requires arbitrary lookahead).
- {postfix, infix}: single lookahead disambiguation based on state.
- {confix/function close, infix}: single lookahead disambiguation based on state.
- {function open, infix}: ambiguous (requires arbitrary lookahead).
- {postfix, confix/function close}: ambiguous (requires arbitrary lookahead).
- {postfix, function open}: single lookahead disambiguation based on state.
- {function open, function/confix close}: single lookahead disambiguation based on state.
Parsing Examples
Mathematical Expressions
Parse the expression x * |y+z| + -3^x^y using the standard mathematical rules for precedence and associativity.
State | Operand Stack | Operator Stack | Token | Token type | Action |
---|---|---|---|---|---|
Pre | x | operand | shift | ||
Post | x | * | infix operator | shift | |
Pre | x | * | | | confix open or confix close | disambiguate as confix open, shift |
Pre | x | * (confix open |) | y | operand | shift |
Post | x y | * (confix open |) | + | infix or prefix operator | disambiguate as infix, shift |
Pre | x y | * (confix open |) + | z | operand | shift |
Post | x y z | * (confix open |) (infix +) | | | confix open or confix close | disambiguate as close, reduce |
Post | x (y+z) | * (confix open |) | | | confix open or confix close | disambiguate as close, reduce |
Post | x (|y+z|) | * | + | infix or prefix | disambiguate as infix, compare precedence, reduce |
Post | (x * (|y+z|)) | + | infix or prefix | disambiguate as infix, shift | |
Pre | (x * (|y+z|)) | (infix +) | – | infix or prefix | disambiguate as prefix, shift |
Pre | (x * (|y+z|)) | (infix +) (prefix -) | 3 | operand | shift |
Post | (x * (|y+z|)) 3 | (infix +) (prefix -) | ^ | infix | compare precedence, shift |
Pre | (x * (|y+z|)) 3 | (infix +) (prefix -) ^ | x | operand | shift |
Post | (x * (|y+z|)) 3 x | (infix +) (prefix -) ^ | ^ | infix | compare precedence, compare associativity, shift |
Pre | (x * (|y+z|)) 3 x | (infix +) (prefix -) ^ ^ | y | operand | shift |
Post | (x * (|y+z|)) 3 x y | (infix +) (prefix -) ^ ^ | end | end | reduce |
Post | (x * (|y+z|)) 3 (x^y) | (infix +) (prefix -) ^ | end | end | reduce |
Post | (x * (|y+z|)) (3^(x^y)) | (infix +) (prefix -) | end | end | reduce |
Post | (x * (|y+z|)) (-(3^(x^y))) | (infix +) | end | end | reduce |
Post | ((x * (|y+z|)) + (-(3^(x^y)))) | empty | end | end | accept |
Douglas Gregor
Last modified: Sat Aug 18 12:46:13 EDT 2001
2 Responses to Shift/Reduce Expression Parsing (by Douglas Gregor)