0. Overview. What is a language parser and what does it do? A language parser is a function which takes as input an expression in some language and produces as output a representation of the syntactic structure of that expression. In this talk I will describe a simplified model of the 1010 expression-language parser (called "the 1010 parser" for short.) Our model parser takes as input an expression in a small functional language L and produces as output a representation of the syntactic structure of that expression in the form of a nested list -- a parse-tree. In what follows, we'll examine a model of our parser. Specifically, (t) the tokenizing of a simple functional expression (p) the production of a parse-tree from (t) Tokenizing takes a string of undifferentiated characters and produces a flat list of tokens: "aa*bb-cc*dd" -> ("aa";"*";"bb";"-";"cc";"*";"dd") Parsing takes a flat list of tokens and adds "structure" to produce a parse-tree: ("aa";"*";"bb";"-";"cc";"*";"dd") -> ("-" ("*";,"a";,"b") ("*";,"c";,"d")) In addition, we'll look at how parse-trees can be validated and then evaluated: (v) the validation of the tree produced by (p) (e) the evaluation of the tree produced by (p) Validation takes a parse-tree and computes the type of the result based on the types of the elements and the type-signatures of the functions. Evaluation takes a parse-tree and computes the result. As noted, the L-parser is a simplified model of the 1010 parser. Appended to each section is a brief description of the differences between model and reality. Note on my use/mention conventions: In what follows I will occasionally (and inconsistently) use single-quotes when referring to functions, automata, states, and other code- and thought-objects. Hence, the function 'foo'. I will sometimes use double-quotes when naming K strings. Hence, "a+b". But when the meaning is clear I will not: the character _. My coding conventions in l.k and k.k are more or less these (not observed consistently): foo, xy global non-entry-point lambdas f, g global entry-point lambdas A, XYZ global non-lambdas a, b local variables x, y, z implicit arguments The implementation of the model parser/tokenizer is quite dense. I recommend exploring the code interactively with liberal use of breakpoints. For example, take the function 'tokens' tokens:{(*:'p _1_ k;(p:&(1_ k _lin a@&1=s)|(~=)':k:(a:!A)[&s:#:'A[]](!M)?/:y)_ x)} and modify it to capture the result, break, and then return: tokens:{r:(*:'p _1_ k;(p:&(1_ k _lin a@&1=s)|(~=)':k:(a:!A)[&s:#:'A[]](!M)?/:y)_ x);`0;r} then while in the suspended state explore the values of the variables in the console: t"abc+def" type error {r:(*:'p _1_ k;(p:&(1_ k _lin a@&1=s)|(~=)':k:(a:!A)[&s:#:'A[]](!M)?/:y)_ x);`0;r} ^ > x " abc+def" > y `S `bla `nam `nam_ `nam_ `vrb `nam `nam_ `nam_ > s 1 2 3 1 1 1 1 1 2 > a `S `nam `num `vrb `neg `bla `rpa `lpa `err > k `S `bla `nam `nam `nam `vrb `nam `nam `nam > r <------- result (`bla `nam `vrb `nam (," " "abc" ,"+" "def")) > : <------ return ("abc" ,"+" "def") 1. Language L. Our model language L has names, numbers (integers and floats), strings, and three operator classes: Dyadic operators: < > = & | + - * / ^ A single monadic operator: ~ And a single ambivalent operator: - Unlike K, it also has operator-precedence: | & = < > + - * / ^ ~ _ where _ is the *internal* symbol for monadic negation - (see below for an explanation of how the L parser handles ambivalent operators.) The only punctuation symbols are ( and ), which override operator-precedence. Names begin with a-zA-Z, and continue with a-zA-Z0-9. Integers and floats are numbers. Integers are sequences of 0-9, and a float is an integer expression followed by a single . followed by an integer expression. That is, both .12 and 12. are invalid, and 0.12 and 12.0 are both valid float tokens. A string is a sequence of characters enclosed in single quotes. The escape character \ can be used to quote a quote, as in 'abc\'def'. Two strings must be separated by at least one character: 'abc''def' is a token error. 'abc' 'def' is valid. 2. Automata. In this paper, my use of the term 'automaton' is somewhat idiosyncratic. For those of you who've studied formal compiler theory, it will not match up perfectly with what you're used to. My goal in this paper is to make the ideas behind tokenizing and parsing as concrete as possible, and to focus on the actual engineering techniques used in our software. With that caveat in place, let's begin. The task of the tokenizer is to take an expression in the form of a string and bust it up into a list of strings, each of whose elements represents a token. l.kk.kt.kl.txt An expression can be faulty in one of two ways: it may contain a bad token, or it may contain a bad character. (In fact, an expression may contain bad tokens AND bad characters, but the tokenizer will stop and report the first error it encounters.) A bad character is a character not in the "approved" list. A bad token is one which contains a sequence of approved characters XY where the transition from X to Y has been defined as invalid. Our K tokenizer is based on the idea of deterministic finite-state machines, which I shall call simply "automata". Let's look at the automaton which recognizes names. A name is a sequence whose first character is in a-zA-Z and whose remaining characters (if any) are in a-ZA-Z0-9. Assume three character-classes ("syntactic types") NAM, NUM, and BLA: NAM:"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" NUM:"0123456789" BLA:" " Let's use the following notation in an informal "automaton description language": A,..,B : C -> S for in state A or ... or B if you see a C then go to state S and read the character following C and : C -> S to mean if the initial character is a C then go to state S and read the character following C. The (minimal) name automaton has exactly two states. An entry-point state which recognizes the first character of the name, which must be in a-zA-Z, and a second state which recognizes subsequent characters in the name, which are drawn from a-zA-Z0-9. Then our automaton for recognizing names looks like this: : NAM -> nam nam,nam_: NAM,NUM -> nam_ nam,nam_: BLA -> bla Our convention is that the entry-point state of an automaton has the same name as the automaton. It follows that each automaton has just one entry-point state. An automaton with more than a single entry-point state can always be factored into as many automata as there are entry-point states, each new automaton being simpler than the parent. Remember that our informal descriptions of tokenizing automata are purely heuristic. We will use them to approximate the behavior we want from our tokenizer, then throw them away (but see section 8 below). In practice, for automata with complex behavior (e.g. 1010 numbers) I will sometimes use diagrams with nodes and labeled arrows, but often I will simply fill in the cells of the transition matrix described below. Thus, in the automaton for names we assume that if the automaton encounters an invalid character, it will transition to the "bad character" state, and if it encounters any valid character other than one in BLA,NAM,NUM it will transition to the "bad token" state. Now suppose that our name automaton is given the string "ab20 ". Initially, the automaton is in the start state. On encountering the first character "a" it transitions to the nam state. The next character is "b", so the automaton will transition to the nam_ state. Next "2", so it will remain in the nam_ state. Similarly for "0". Finally, it will see a blank, and transition to the bla state. The sequence of states can be represented as (i): (i) "a" -> nam "b" -> nam_ "2" -> nam_ "0" -> nam_ " " -> bla. Start, see "a", go to nam, see "b", ... Now let's add a simple automaton for recognizing blanks: : BLA -> bla bla: BLA -> bla bla: NAM -> nam and see what happens when we feed the resulting two-automaton system the string "ab20 c4 ". The sequence of states traversed by the system as it consumes this string is (i) followed by (ii): (ii) "c" -> nam "4" -> nam_ " " -> bla. Now, if we list just the states from the concatenation of (i) and (ii) we have (iii): (iii) nam nam_ nam_ nam_ bla nam nam_ bla. Grouped by automaton: automaton state character --------- ----- --------- name nam a name nam_ b name nam_ 2 name nam_ 9 blank bla name nam c name nam_ 4 blank bla Finally, to turn the input string into a list of token-strings, partition the expression-string by the automata, prefixing the list with ` to indicate the start automaton: (&(~=)':``name`name`name`name`blank`name`name`blank)_"ab20 c4 " ("ab20" ," " "c4" ," ") 3. The L-Tokenizer. In this section we'll implement the design outlined above. First, we'll need to define the relevant token-classes: BLA:" " RPA:")" LPA:"(" DOT:"." NOT:"~" NEG:"-" QUO:"'" ESC:"\\" VRB:"<>=&|+*/^" NAM:"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" NUM:"0123456789" CHR:_ci[!256]_dvl BLA,RPA,LPA,DOT,VRB,NOT,NEG,NAM,NUM,QUO,ESC NEG_:"_" Our implementation will assume that the set of token-classes is mutually exclusive and jointly exhaustive. Hence the definition of CHR as a bucket for all the characters which are not members of the valid token-classes. (We will use NEG_ in the tokenizing of negation.) Next, we'll represent the system of automata as a state x class matrix: TR:" . . NAM NUM DOT VRB NOT NEG BLA RPA LPA QUO ESC CHR S S nam num tok tok vrb neg bla tok lpa quo tok chr nam nam nam_ nam_ tok vrb vrb vrb bla rpa tok quo tok chr nam nam_ nam_ nam_ tok vrb vrb vrb bla rpa tok quo tok chr num num nam num dot vrb vrb vrb bla rpa tok quo tok chr num dot tok num_ tok tok tok tok tok tok tok tok tok chr num num_ nam num_ tok vrb vrb vrb bla rpa tok quo tok chr vrb vrb nam num tok tok tok neg bla tok lpa quo tok chr neg neg nam num tok tok tok neg bla tok lpa quo tok chr bla bla nam num tok vrb vrb neg bla rpa lpa quo tok chr rpa rpa nam num tok vrb vrb vrb bla rpa tok quo tok chr lpa lpa nam num tok tok vrb neg bla tok lpa quo tok chr quo quo quo quo quo quo quo quo quo quo quo quo_ esc chr quo esc quo quo quo quo quo quo quo quo quo quo quo chr quo quo_ nam num tok vrb vrb vrb bla rpa lpa tok tok chr err tok . . . . . . . . . . . . err chr . . . . . . . . . . . . " Note that TR is a character-vector with embedded new-lines. We'll use the function 'tr' to extract the required information. The first column lists the automata in the system. The second column lists the states of each automaton. Thus, the num automaton has three states, num, dot, and num_. S is the start automaton, and it has just one state S. Ideally, we'd like the start automaton and start state to be nameless, but our implementation will rely on a two-dimensional dictionary having the structure automaton-state x token-class, and K disallows dictionary keys which aren't valid variable names. Note in particular the error automaton err, which has two states: tok, for token errors, and chr, for character errors. Once the tokenizer has entered one of the error states it has nowhere to go, so we just use . to indicate "no continuing state". Our implementation strategy will be to derive a set of global maps from TR: tr:{ a:`$cut[" "]'sqz cut["\n"]deb x;s:1_ a[;1];k:1_ a[;0] D:{.+(x;("",.:)'x)}2_*a;M:.+(s;(!D){.+(x;y)}/:1_2_'a);A:.+(?k;s@=k) (A;M;D)} @[_d;`A`M`D;:;tr TR]; A is a dictionary which maps automata to states. For example: A`num `num `dot `num_ The transition matrix M is a two-dimensional dictionary of states. For example: M[`num;`DOT] `dot M[`dot;`NUM] `num_ M maps pairs of automaton-states and token-classes to states. If the tokenizer is in the num state and sees a DOT, then transition to the dot state; then if the tokenizer sees a NUM, transition to the num_ state. The dictionary D maps token-classes to vectors of characters: !D `NAM `NUM `DOT `VRB `NOT `NEG `BLA `RPA `LPA `QUO `ESC `CHR D[x] is a list of characters in class x. For example: D`VRB "<>=&|~+*/^-" So we have: A : automata -> states D : token-classes -> character-vectors M : states x token-classes -> states The tokenizer consists of five functions and one static variable: t:{neg_[1_ neg . tokens[x]`S token\x:BLA,x]_dv,BLA} / tokenizer token:{:[(r:M[x](!D)(y _in/:D[])?1)_in`tok`chr;err[y]r;r]} / character -> state tokens:{(*:'p _1_ k;(p:&(1_ k _lin a@&1=s)|(~=)':k:(a:!A)[&s:#:'A[]](!M)?/:y)_ x)} / states -> tokens N:`vrb`neg`bla`lpa,\:`neg`num / negation patterns neg:{{@[x;y 1 2;:;x[,0],,,/x 1_ y]}/[y;i@&|/N~/:\:x i:(!3)+/:!-2+#x]} / N neg num -> N bla num neg_:{@[x;&{(&/y _lin VRB,NOT,LPA,NEG)&x~,NEG}':LPA,x;:[;,NEG_]]} / - x -> _ x 't' is the entry-point. We will adopt this convention in what follows: lower-case single letters for entry-point functions. The 'token' function computes the sequence of automaton-states of the input string x, and 'tokens' uses that result to chop the string into token substrings. We can see the result of scanning 'token' over an input string x: `S token\"abc+def" `S `nam `nam_ `nam_ `vrb `nam `nam_ `nam_ The function takes two arguments: x, a symbol of the current automaton-state, and y, the next character to be scanned. Since M is a two-dimensional dictionary of automaton-states x syntax-classes, 'token' must find the class of character y, then use x and y as indices into M to compute the next state r. if r is `tok or `chr then signal an error, else return r. Once `S token\x has computed automaton-state for each character in x, it passes it and x to 'tokens'. The function 'tokens' computes two boolean vectors which it uses to cut x into tokens. The two vectors correspond to the two different cases. I've expanded two of the intermediate assignments to make the logic more evident: A = (~=)':k:(!A)[&#:'A[]](!M)?/:y = mark the points at which we transition from one automaton to another B = 1_ k _lin (!A)@&1=#:'A[] = mark the single-state automata (blanks, verbs, and parentheses) The disjunction p:&A|B is then used to cut the string into tokens. x:"abc+-20.2" y:`S token\x y `S `nam `nam_ `nam_ `vrb `vrb `num `num `dot `num_ tokens[x]y (`nam `nam `nam `vrb `neg `num `num `num `num ("abc" ,"+" ,"-" "20.2")) The result is a list of (automata;tokens). In this language (and in the 1010 expression language) the only ambivalent verb is -. We can have x-y as well as -x, and in the monadic case we can have a pair of tokens (-;E) where E is an expression or a single token -n where n is a number. In the first case - is a function and in the second it's part of the number: x - y -> x - y - a + b -> neg a + b - 12 -> -12 where the spaces delimit the tokens in each case. Before parsing proper occurs, we want to identify and resolve cases of ambivalence, and to do that we use two functions called sequentially. The 'neg' function takes the automata-vector and the token-list and detects and transforms instances of numeric negation. For example, the input string a+-3 is cut into (," " ,"a" ,"+" ,"-" <-- ,"3") and then matched against triples of automata which find occurrences of "-" adjacent to number tokens: a `bla `nam `vrb `neg `num ^^^^^^^^^^^^^^ These triples are then transformed: (," " ," " ,"a" ,"+" "-3") <-- The 'neg_' function identifies monadic uses of - and replaces them with _. ("a" ,"-" ,"-" <-- "b" ,"*" "-3") is transformed to: ("a" ,"-" ,"_" <-- "b" ,"*" "-3") Finally, singleton blanks are deleted from the result. All three cases in a single expression: t"a---3" (,"a" ,"-" <-- subtraction ,"_" <-- negation "-3") <-- negative Note: The 1010 parser uses a different, and more efficient technique for tokenizing. Consider just the VRB column of the TR variable. The characters in the VRB category are < > = & | ~ + * / ^. The ASCII values for these symbols are: _ic VRB 60 62 61 38 124 126 43 42 47 94 Similarly, the states of the transition matrix are: !M `S `nam `nam_ `num `dot `num_ `vrb `neg `bla `rpa `lpa `tok `chr and which can be enumerated as: !#!M 0 1 2 3 4 5 6 7 8 9 10 11 12 Now, instead of generating a two-dimensional dictionary M, we have the 'tr' function produce F, a (#-of-states) x (256-integer) matrix, where the values in the cells are row indices. Columns in F correspond to the ASCII values of the characters in the categories: !D `NAM `NUM `DOT `VRB `NEG `BLA `RPA `LPA `CHR With this representation in place we can use the much more efficient "matrix scan": r:0 F\_ic s where s is the input string and r is the resulting state-transition vector. The attached script m.k implements this version of the tokenizer. Note also that K does not implement D\ for two-dimensional dictionary D. 4. Bottom-Up Shift-Reduce Parsing. The 1010 parser is a form of operator-precedence parsing. A very good semi-technical explanation can be found here: http://epaperpress.com/oper/index.html. (Thomas Niemann's paper was the inspiration behind the first version of the parser we're using at 1010.) In this section I'll take an independent tour of the ideas involved. Together with Niemann's paper and the code in the l.k script you should be in a position to completely understand the technique and devise similar parsers for your own applications. Our parser takes a list of tokens and produces a parse-tree, a recursive list of the form ("operator-of-valence-n";a1;..;an) where ai is either a parse-tree or a string representing either a name or a number. Two examples: (a) p t"a*b+c" ("+" ("*";,"a";,"b") ,"c") (b) p t"a+b*c" ("+" ,"a" ("*";,"b";,"c")) As you can see from this example * binds tighter than +. In other words, * has higher precedence than +. The shift-reduce parser implements this idea by using four operations to produce a parse-tree from a list of tokens. The four operations are: s = shift r = reduce n = noun a = accept The parser moves left-to-right across the list x of tokens, and each pair x i+0 1 is used to look up the appropriate operation. The operation table also implicitly defines the precedence relations and the associativity, right or left, of each operator. For example, in the tiny language consisting of nouns (i.e. names and numbers) and the two operators + and *, in which * has higher precedence than +, we can define the relations using four variables: Z:_ci 255 / end of input/stack O:"+*",Z / operators / +*Z / shift-reduce table: s = shift -> right assoc, r = reduce -> left assoc S:("rsr" / + "rrr" / * "ssa") / Z / + * Z / operator valences V:2 2 0 S is an operator-class x operator-class table. In this tiny language, each operator-class contains just one operator, and for convenience we count Z, the end of input marker, as an operator. Tokens which are not one of + * or Z are nouns, and the operation look-up function (see below) will default to the "n" operation in that case. In order to see how the parser actually works, let's trace the parsing of both of our examples and place them side-by-side for comparison: p t"a*b+c" p t"a+b*c" n | | a*b+c | | () n | | a+b*c | | () s | a | *b+c | | ,,"a" s | a | +b*c | | ,,"a" n | a* | b+c | * | ,,"a" n | a+ | b*c | + | ,,"a" r | a*b | +c | * | (,"b";,"a") s | a+b | *c | + | (,"b";,"a") s | a*b | +c | | ,("*";,"a";,"b") n | a+b* | c | *+ | (,"b";,"a") n | a*b+ | c | + | ,("*";,"a";,"b") r | a+b*c | | *+ | (,"c";,"b";,"a") r | a*b+c | | + | (,"c";("*";,"a";,"b")) r | a+b*c | | + | (("*";,"b";,"c");,"a") a | a*b+c | | | ,("+";("*";,"a";,"b");,"c") a | a+b*c | | | ,("+";,"a";("*";,"b";,"c")) ("+" ("+" ("*";,"a";,"b") ,"a" ,"c") ("*";,"b";,"c")) The first column is the operation: n, s, r, or a. The second column is the segment of the expression processed so far. We'll use this when signalling an error. The third column is the remaining unprocessed segment of the input string. The fourth column is the operator stack. The fifth column is the value stack, where we build the output tree in the course of parsing. In both examples, the first three steps are: n: push a onto the value stack s: shift the operator (* or +) onto the operator stack n: push b onto the value stack Let's work through the first example. Since * has precedence over +, we reduce the operator stack: r: push * from the operator stack onto the value stack s: shift + onto the operator stack n: shift c onto the value stack r: push + from the operator stack onto the value stack a: end of string - we're done Now compare this with the second example: s: shift * onto the operator stack (we now have two operators on the operator stack) n: shift c onto the value stack r: push * from the operator stack onto the value stack r: push + from the operator stack onto the value stack a: end of string - we're done Shift and Reduce operations implement (via stacks) the precedence relations of + and *. Note: The 1010 parser contains an elaborate post-tokenizing/pre-parsing function, designed to handle bracket-notation for lists and dictionaries, transform infix dyadic catenation to prefix polyadic catenation, &c. But the basic idea is the same: transform a list of raw tokens into a list of processed tokens suitable for parsing. *** Challenge: You're given a list q of tokens, including left and right parentheses: :q:t"a+((b*c)-d)%e" (,"a" ,"+" ,"(" ,"(" ,"b" ,"*" ,"c" ,")" ,"-" ,"d" ,")" ,"%" ,"e") Write a function 'n' which nests on parentheses: n q (,"a" ,"+" ((,"b" ,"*" ,"c") ,"-" ,"d") ,"%" ,"e") 5. The L-Parser. Our implementation consists of two main functions: p:{**|act/("";x,,Z;,Z;"")} The monadic 'act' function is applied repeatedly on a list containing: the segment of the expression processed so far the remaining unprocessed segment of the input string the operator stack the value stack (These elements exactly correspond to the columns of the trace shown above.) The result (when the function converges) is the parse-tree, which is the first of the last element of the processed list. The 'act' function is: act:{t:(S,'"n"). cls'x[2 1;0];if[T;trace[t]. x];H[`$t]. x} First: look up the operation in S based on the class of first element on the operator stack (x[2;0]) and the first element of the unprocessed input string (x[1;0]). If an element has no class (for example, the element is a name or number), the 'n' operator will be selected. The 'cls' function uses O to determine the class of the operator: cls:{(|/'x _in/:O)?1} The result is either a valid index into the row or column dimension of the S matrix, or 1 + the count of that dimension. Having determined the type t, we may trace the current state of x. Finally, apply the appropriate operation to x. Each parser operation takes the current state (broken up into four arguments j i o v) and returns the next state. Based on the examples given above, these should be fairly easy to decipher. / parser transition operators H.a:{[j;i;o;v](j;i;o;v)} / accept H.s:{[j;i;o;v](j,*i;1_ i;i[,0],o;v)} / shift operator -> o-stack H.r:{[j;i;o;v]:[0 v-stack / reduction ops rv:{[j;i;o;v]:[~0#v;err[j]"valence";n]} / valence of o or error ro:{[j;i;o;v;n](j;i;1_ o;(,o[0],|n#v),n _ v)} / o = f v1 .. vn rp:{[j;i;o;v;n](j;i;(-n)_ o;v)} / o = ( or ) The full precedence table for L is: / end-of-input/stack Z:_ci 255 / semantic classes O:"",/:/:("|";"&";"=<>";"+",NEG;"*/";"^";"~",NEG_;LPA;RPA;Z) / |&=+*^~()Z / shift-reduce table: s = shift -> right assoc, r = reduce -> left assoc S:("rsssssssrr" / | "rrssssssrr" / & "rrrsssssrr" / = < > "rrrrssssrr" / + - "rrrrrsssrr" / * / "rrrrrrssrr" / ^ "rrrrrrrsrr" / ~ _ "sssssssssR" / ( "rrrrrrrrrr" / ) "ssssssssLa") / Z / | & = + * ^ ~ ( ) Z c V:2 2 2 2 2 2 1 0 -2 0 0 Observe how the parser handles parenthesis-reduction. ( has valence 0 and always causes a shift. ) has valence -2, and always causes a reduction. The L and R operations are used to detect missing left and right parentheses. Note as well the following "quirk" in the shift-reduce method: the parser will also accept expressions in reverse polish notation! p t"2+3" p t"2 3+" n | | 2+3 | | () n | | 23+ | | () s | 2 | +3 | | ,,"2" n | 2 | 3+ | | ,,"2" n | 2+ | 3 | + | ,,"2" s | 23 | + | | (,"3";,"2") r | 2+3 | | + | (,"3";,"2") r | 23+ | | + | (,"3";,"2") a | 2+3 | | | ,("+";,"2";,"3") a | 23+ | | | ,("+";,"2";,"3") ("+";,"2";,"3") ("+";,"2";,"3") (FORTH anyone?) *** Challenge: Can you suggest a way to block the parsing of expressions in RPN (but not their infix equivalents)? Note: The 1010 parser uses a much more elaborate operator scheme, and incorporates typing and argument-transformation (semantic information) into the parsing of expressions. This is an historical feature of the 1010 parser. In our model we separate out the semantic typing and validation (see the next section). 6. Typing. Typing and validation in our model parser is relatively straightforward: given a parse-tree, recursively evaluate down the tree to the leaves. A leaf is a string x representing either a variable or a number, so provide some method for computing its type: __abs 4:. x If the node is not a leaf, then it must have the form: (o;v1;..;vn) so provide some method for computing the type of the result of applying o to arguments of types (t1;..;tn) of (v1;..;vn): R:+(("|&+-*" ;(1 1 1;2 2 2;2 2 1;2 1 2)) ("=<>" ;(1 1 1;1 2 2;1 2 1;1 1 2)) ("/^" ;(2 1 1;2 2 2;2 2 1;2 1 2)) ("~_" ;(1 1;2 2))) where operators are paired with a list of (r;t1;..;tn). Thus, given o and (t1;..;tn), find the appropriate (r;t1;..;tn). r is the type of the result of applying o to arguments of those types. The code is then: v:{:[4:x;__abs 4:. x;res[*x;_f'1_ x]R[1;(x[0]_in/:*R)?1]]} res:{:[(#z)>i:(y~/:1_'z)?1;z[i;0];err[x]"type"]} For example: x:10;y:20.3 v p t"x+y" 2 Note: The recursive function 'v' exemplifies a pattern used in the 1010 parser for several pre- and post-processing routines. A second common pattern uses the "prior" transformation. 7. Evaluation. Finally, the evaluation method 'e' is a simple recursive function which takes a parse-tree and returns the result of evaluating it: e:{:[4:x;. x;(.{:[x=NEG_;NEG,":";x]}@*x). _f'1_ x]} Note that our stand-in "_" for monadic negation is finally replaced with "-", but only at the point where we need to apply the operator. e p t"x+y" 30.3 Note: In the 1010 parser we construct an executable string and return that to the accum process. For an example of this method, see the function 'r' in the section below on the K parser. 7. Errors. The parser recognizes six error-types: token character valence missing ) missing ( type The error function is: err:{'x,"\n",((0|-1+#x)#""),"^ ",$y} Here are examples: token: t"1..2" . ^ tok character: t"1#2" # ^ chr valence: p t"a+" a+ ^ valence missing ) - this is the R operation: p t"(123" (123 ^ missing ) missing ( - this is the L operation: p t"123)" 123) ^ missing ( type: a:10;b:"a string" v p t"a+b" + ^ type A. Automaton Description Language. The global maps derived from TR can be used to generate a script in "automaton description language": / automaton description language a:{[M;A](ift .)',/automaton[M]'[!A;A[]]} automaton:{[M;a;s]start[M`S;a],mesh[a]@,/s state[M`S]'M s} start:{[S;a],(`;(!S)@&a=S[];a)} mesh:{y[=y[;1 2];0]{(,x),y}'?y[;1 2]} state:{[S;s;d]+(s;(!d)@=e;?e:d[])} ift:{[s;c;t]glu[",";$s],":",glu[",";$c],"->",$t} `0:a[M]A For example, to generate the script for the number-tokenizing automaton: `0:automaton[M;`num]A`num :NUM->num num,num_:NAM->nam num:NUM->num num:DOT->dot num,num_:VRB,NEG->vrb num,num_:BLA->bla num,num_:RPA->rpa num:LPA->tok num,dot,num_:CHR->chr dot:NAM,DOT,VRB,NEG,BLA,RPA,LPA->tok dot,num_:NUM->num_ num_:DOT,LPA->tok The script contains two expression-forms: :CLASS->state from-states:CLASSES->to-state The first form identifies an automaton and its entry-point state. The second form identifies a set of state transitions. Observe that some transitions send the system into states of other automata. For example, the transition num_:LPA->tok says that if the num automaton is in state num_, then seeing a . or a ( will send it into the tok state of the error automaton. *** Challenge: write the inverse of a. That is: (M;A)~m[D]a[M]A That is, m regenerates M and A from s. Hint: use t, p, v and e! B. Two-by-two Parsing for Iverson Notation (K). Although it is possible to use the Shift-Reduce method to parse K and K-like languages (J, APL, A), there is an alternative method which I prefer due to its overall simplicity and conceptual transparency. Here is a link to an article which explains the method in detail: http://nsl.com/papers/kparse.txt I've also attached a script k.k which combines the tokenizing method of l.k with the two-by-two parsing method of Bunda and Gerth. k.k tokenizes and parses a significant fragment of K: numbers, symbols, names, strings, verbs, adverbs, vector notation, list notation, and bracket indexing. Not supported are conditionals, control keywords, assignment, system functions and variables, commands, n: operators, and scientific notation for floating point numbers. For example: p n t"a-(+/'b*c)+(d*e)%f" (`n ((`m ((`n;,"a") (`v;,"-"))) (`n ((`m ((`n ((`v ((`v ((`v;,"+") (`a;,"/"))) (`a;,"'"))) (`n ((`m ((`n;,"b") (`v;,"*"))) (`n;,"c"))))) (`v;,"+"))) (`n ((`m ((`n ((`m ((`n;,"d") (`v;,"*"))) (`n;,"e"))) (`v;,"%"))) (`n;,"f"))))))) t = tokenize, n = nest parens, p = parse. Notice that verbs (`v) like + take left and right noun arguments (`n) and are parsed in two steps: a+b n v n (a+)b n v -> m ((a+)b) m n -> n Dyads (transitive verbs) are left-projected ("curried") into monads (intransitive verbs) and then applied: p n t"a+b-*/'c" (`n ((`m ((`n;,"a") (`v;,"+"))) (`n ((`m ((`n;,"b") (`v;,"-"))) (`n ((`v ((`v ((`v;,"*") (`a;,"/"))) (`a;,"'"))) (`n;,"c"))))))) r p n t"a+b-*/'c" "((a+)((b-)(((*/)')c)))" Observe also that parenthesized verbs are "nominalized" -- turned into nouns: p n t"(+)" (`n;,"+") Not so for adverbs: p n t"(/)" / ^ tok A simple function 'r' to turn the parse-tree into an executable string: r p n t"+/10+!20" "((+/)((10+)(!20)))" . r p n t"+/10+!20" 390 C. M\ and Truth-functional connectives. We have the truth-functional connectives & ("and") and | ("or"). Given p, a vector of "propositions" (truth-values) p:0 1 1 0 0 1 we can ask if all of them are true: &/p 0 and if any of them are true: |/p 1 If we did not have / we could use last and scan: &\p 0 0 0 0 0 0 *|&\p 0 |\p 0 1 1 1 1 1 *||\p 1 The connectives can be thought of as having built into them the truth-table matrices: &: a:0 1 b --- 0|0 0 1|0 1 |: a:0 1 b --- 0|0 1 1|1 1 From truth-values a and b we can calculate the truth-value of a&b by indexing the truth-matrix for & by a and b: Now we can use M\ to simulate &\ and |\: p:0 1 1 0 0 1 and:(0 0;0 1) &\p 0 0 0 0 0 0 and\p 0 0 0 0 0 0 all: and\1 1 1 1 1 1 1 1 and for a|b: or:(0 1;1 1) |\p 0 1 1 1 1 1 or\p 0 1 1 1 1 1 none: or\0 0 0 0 0 0 0 0 We can simulate ~ ("not") with the truth-matrix (1 0;1 0): not:(1 0;1 0) 1_0 not\p 1 0 0 1 1 0 Truth-matrices can be composed to yield other truth-functions: p:100_draw 2 a:{(x|y)&~x&y}\p b:(or&~and)\p a~b 1 or&~and (0 1 1 0) Lambda\ performance declines with complexity of the function: p:1000000_draw 2 \t &\p 5 \t {(x|y)&~x&y}\p 227 M\ performance is constant: \t and\p 100 \t (or&~and)\p 65