It's also possible to implement regular expression support in K. Regular expression implementations tend to involve arrays and finite state machines, so K is almost an ideal language for working with regular expressions.
One approach to regular expressions involves
"deterministic" state transitions (where the state
transition corresponds to a character from some string of
characters), and "non-deterministic" state transitions
(where the state transition happens without any corresponding
character).
In this code, I initially repreent a regular expression as an
array where the first dimension represents a "state
index", the second dimension is indexed by character number,
and the third dimension represents the set of states which are
reached from that "state index" and that
"character number".
The state machine itself represents a set of states. The initial
state is 0 (which is always a "non-deterministic"
state), and the final state is the one with the highest numbered
state index.
Here's an example regular expression matching some floating point
numbers:
num: rep[re[,"0123456789"]] sign: maybe[re[,"+-"]] fp: fix seq[(sign;num;maybe[seq[(re[,"."];num)]];maybe[seq[(re[,"Ee"];sign;num)]])] match[fp; "1"] / 1 match[fp; "3.1415e0"] / 1 match[fp; "dog"] / 0
This would display most of the interesting state transitions:
fp[;256,_ic " 1-e"]
Note that this example (fp), while relatively simple to
implement, is very bulky. It's possible to use more compact
representations. For example, instead of indexing by character
index, you could identify character classes where all characters
within a class are treated identically.
And you might get rid of all states which have a
non-deterministic state transition. It's worthwhile thinking
about why this could work. The most important thing to note: the
only "non-deterministic state" I care about is state 0
-- that's where I find the initial deterministic states.
So Shrink creates an alternate regular expression data
representation which is a three element array:
[0] the list of starting states [1] index by character number to get character class [2] first dimension is indexed by character class second dimension is indexed by (new) state index third dimension is list of (new) states
And, Match uses this alternative representation.
Thus:
FP: Shrink fp Match[FP; "1"] / 1 Match[FP; "3.1415e0"] / 1 Match[FP; "dog"] / 0
It's possible to extract subexpressions, as well. I've not yet
implemented this, but here's a design sketch:
Regular expressions have associated with them a dictionary of
subexpressions (initially empty). Each subexpression also has an
offset corresponding to the terminal state for that
subexpression.
When combining regular expressions, this offset needs to be
updated.
When matching regular expressions, use Next\ instead of Next/ so
that the terminal states associated with subexpressions can be
located.
Run the corresponding subexpression backwards from the character
for that terminal state to find the corresponding begining.