Precedence Parser
Introduction
The script parse.k is a tool for implementing languages with precedence relations. (a k4 version is here).
Example Language
Assignment v:e
Logic p&q, p and q, p|q, p or q, ~p, not p
relations a in b, a~b, a=b, ab, a<=b, a>=b
arithmetic a+b, a-b, a*b, a/b, a^b, -a, neg a
lists a,b, ,a, list a
comparison a min b, a max b
aggregation sum a, avg a
conditional if(a;b;c), integer = type(a), type(b)=type(c)
function f(a), g(a;b), h(a;b;c)
types 12 (int), 12.2 (float), "12.2" (symbol)
Variable API
The behavior of the parser is largely data-driven. Change the language to be parsed -> change a set of
variables. Apart from a few functions which are earmarked to be modified by the host application, the behavior of the parser is controlled by 26 read-only global variables:
O
The operator table, containing the valid operators of the language L to be parsed, grouped into classes. to add an operator of an existing class (i.e. one which has the same syntactic behavior), just append it to the appropriate sublist of O, taking care to adjust other variables accordingly. to add a new syntax class, add a new sublist containing the operators of that class, and adjust other variables accordingly.
W
A list of the 2-character symbolic operators.
S
The assignment operator.
R
Maps operator classes to types.
The type of an operator o of valence n is a list of type-vectors. Each type-vector t specifies the type of the result t[0] as a function of the types of its arguments 1_ t. The parser recognizes three primitive types: 1 (integer), 2 (float), and 4 (symbol). All character vectors are treated as symbolic atoms. When you add a new operator, you need to add type information for it to R.
V
A vector of valences, corresponding to the columns of the shift-reduce table P.
P
The rows of P correspond to operator classes.
Note that four of the syntax classes have special valence: ',' (argument separator) has valence -1, '(' has 0, ')' has -2, and Z (end of string) has 0. When one of these is detected (because valence is <= 0), a routine is invoked which reduces the operator-stack (see below) without affecting the value-stack. when ',' is detected, we throw it away; when '(', we throw nothing away; when ')' is detected, we throw it and '(' away; and when Z is detected (we're done) we throw nothing away. ',' and Z are no-ops. '(' and ')' have no semantic effect. c (constant) also has valence 0, but that's just a place-holder to keep V in sync with the columns of P. It's never used.
The basic parsing operation is to find and execute the appropriate operation given an operator of class c and an input token of class d. That operation is given by P[c;d] -- a simple lookup given the classes. In the course of parsing, there are two lists: the input parsed so far j and the input yet to be parsed i, and two stacks: the operator-stack o and the value-stack v. The parser moves tokens from i to j, and uses the two stacks to hold the current state. Parsing is therefore programmed in K as a 'converge': keep parsing until the result matches the input -- then the result tree is in v:
**|act/(i;j;o;v) /first of the last of final step
There are four atomic operations:
c shift the constant to the value-stack
s shift the operator to the operator-stack>
r construct a tree from the top of the operator-stack and n values
from the value-stack and shift the result to the value-stack
a accept the result (you're done).
There are also several error conditions, including a general "syntax error" if the result is not a tree.
There are one or two oddities:
2+3
+(2;3)
+2 3
+(2 3)
are equivalent, but
a+b
+(a;b)
while equivalent to each other, are not in general equivalent to
+a b
+(a b)
K
A translation table from the language L to K. e.g. '/' -> '%'. When you add a new operator, you may need to add a translation to K. if the translation is to a non-primitive K function, then you also need to define that function. e.g. 'sum' -> 'SUM', which is defined as +/.
J
K verbs.
W
List of multi-token operators.
U
List of ambivalent operators and their unary translations. e.g. unary minus '-' and its translation 'NEG'.
X
Numeric negation tokens; e.g. ("-";"NEG")
E
Lazily evaluated functions; e.g. ,"IFTE"
G
Grouping, usually left and right parens ("()")
L
List separator, usually either "," or ";"
M
L-expressions are tokenized using a finite state machine M and pointer-chasing: 0 M\_ic.
C
Token classes are kept in B (blank), D (dot), N (nums), A (alphas), Q (quote), F (operators) and Z (end-of-string). C is a list of all the token-classes. the matrix you see in the script is #states x #token-classes. I'll document it later, since adding new token-classes or modifying existing ones requires that you redefine the machine -- a tedious but conceptually simple task.
I
Gives the end-points of the token-classes C. E.g. it takes three machine states to define tokenization for a number: digits before the decimal point, the decimal point, and digits after the decimal point. I says how to group these three states into a single super-state "number".
Y
Currently not used.
Functional API
There are nine functions in the API:
token
Takes an L-expression and breaks it into tokens according to M
parse
Takes a tokenized string and returns a parse-tree
type
Takes a tokenized string and returns the type of the result
eval
Takes a parse-tree and executes it recursively
gets
Takes a parse-tree and returns a unique'd list of used variables
sets
Takes a parse-tree and returns a unique'd list of assigned variables
rep
Takes a parse tree and constructs an infix representation of the corresponding k-expression
str
Takes a parse-tree and constructs an executable k-expression
run
Tokenizes, parses, and type-checks/evaluates an L-expression
int
Interpreter: read-run-write loop. \\ to exit, \ to escape to k, e.g. \T:15 to trace.
Functional API (External)
There are five "external" functions. An application can redefine one or all of these. The script ltable.k illustrates how the parser can be used to implement simple table operations.
ERR
When an error is detected, the first argument is the input so far and the second argument is the error type. by default, the string and error are signalled to the application
GET
Takes a symbol as argument and by default returns the value of that symbol in the application
RET
Takes a symbol as argument and by default returns that symbol
SET
Takes the result of RET and a value, and by default assigns the value. note that SET:{y} is "comment" and SET:{`0:,$x;y} is "trace" (!!)
VAR
Takes a symbol as an argument and by default returns the type of that symbol in the application
LIT
Takes a literal as an argument and by default returns the type of that literal
Thanks to Thomas Niemann and Wayne Miraglia for comments and corrections.