memoization =========== 0. memoization the query cache consists of a set of queries Q1 .. Qn together with a few pieces of auxilliary information: Q - query T - base table D - dependencies P - materialized as table suppose we have an input query q with base table t and dependencies d: q: query t: base table d: dependencies the memoization algorithm attempts to find a query Q in the cache which satisfies the following conditions: T~t? &/D _lin d &/Q _lin q? Q is cache-consistent with q we could dispense with the cache-consistency function if none of the queries in q (and therefore Q) involved cache-sensitive functions such as g_sum. the details of the cache-consistency function are described below. if q matches Q, then we delete the ops common to q and Q, and re-define the base op of the resulting query as P. schematically, where t = T and Q has been materialized as P, then r, the result of matching q and Q, has base table P: Q = ABC q = abCAcBdef - r = abcdef here's a simple example. we start by defining a simple table x.t: 1 2 3
now we add the following query Q to the global query cache: Q = notice that Q materializes as the table x.s. now let the input query q be: q = = = i've marked the common ops of q and Q with =. notice that the ops in q which match those in Q are neither contiguous nor in the same order. the result of maching q and Q is r: r = 1. the match algorithm the match algorithm follows the pattern shown above. let's follow the code: / match input against stored query match:{[t;o;b;s;d;O;D;P] if[&/D _lin d / do dependencies match? if[&/O _lin o / if all O in o if[cachecons[o;O;b]s / cache-effect constraints :(.dbu.H P;rmatch o _dvl O)]]]} / (path;memoized q) the arguments: t = base table of q and Q (not used in match) o = input query b = b[i] = 1 iff o[i] breaks (e.g. a willbe with a g_sum) s = s[i] = 1 iff s[i] selects (sel, (link/amend)-with-selects) d = input base table dependency information O = stored query D = stored query base table dependency information P = materialized table generated by O we can walk through the lines: [1] if the stored query's dependencies D are not stale. [2] if all the ops O in the stored query are in the ops o of the input query. [3] if the cache-effects of o and O are consistent. [4] return (new base table P;recursive match of o - O). we should look more closely at the cache-effect logic. the basic idea is that a prior selection statement can affect the result of a subsequent op if that op uses a function which is sensitive to the rows selected at that stage of the query. for example, the following two queries compute different results: 1,1,1 2,2,1 2,1,1 1,2,1 1,1,1 1,3,1
i v w a - - - - 1 1 1 2 1 2 1 1 1 1 1 2 1 3 1 1 1,1,1 2,2,1 2,1,1 1,2,1 1,1,1 1,3,1
i v w a - - - - 1 1 1 3 1 2 1 2 1 1 1 3 1 3 1 1 the cache-effect constraint logic is designed to prevent a query q from matching one in which the order of the cache-sensitive ops would produce an incorrect result. the cache-effect constraint function is: / cache-effect constraints cachecons:{[o;O;b;s] :[~|/b i:&w:o _lin O ;1 / match if no breaks ~O[&O _lin p]~p:o@&b|s ;0 / no match if breaks/sels out of order ~|/s@&~(1+*|i)#w]} / no sels before breaks the arguments: o = input query O = stored query b = b[i] = 1 iff o[i] breaks (e.g. a willbe with a g_sum) s = s[i] = 1 iff s[i] selects (sel, (link/amend)-with-selects) now we can walk through the logic, where 'cs' and 'ca' abbrevate 'cache-sensitive' and 'cache-affecting' respectively, and 'co' abbreviates 'common ops' [1] if there are no cs ops among the co, then we affirm the match. [2] else if the ca and cs ops in co are out of order, then we deny the match. [3] else if there are no ca ops before any of the cs ops, then we affirm the match. 2. recursive matching. consider the following series of queries. 1 2 3
" the first query in the series materializes as x.z. the second, third, and fourth queries each use x.z as their base tables, and materialize as x.a, x.b, and x.c. each of these queries defines a willbe: a, b, and c. in what follows i'll name the queries after the tables they materialize: z, a, b, and c. now consider the following input query q: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] lines 1 and 2 match query a, so the result of the match is: [5] [6] [7] [8] [9] [10] [11] the result of the match is: [8] [9] [10] the result of the match is: : unwinding the recursion, we have query which is memoized at all levels: [*1] [3] [*4] [6] [*7] [9] [10] [11] [12] i've removed the lines which have been precalculated in the materialized tables, and prefixed * to lines which refer to those tables. 3. exhaustive matching. consider the following series of queries: k:q[-26;`set;`x.z]" " k:q[-27;`set;`x.d]" " the second query e uses the materialization x.d of the first query d as its base table. the materialization x.e of e therefore contains three willbes: d, e, and f. now consider the following input query q: initially, q matches d, resulting in the query: which matches e, resulting in the query: the memoization algorithm is applied repeatedly until the input query is reduced to the most efficient form. that is, to the form where as much of the work as possible is precomputed. 4. implementation. the memoization code lives in one script, mdb/memoization.k. the algorithm uses two global variables, SET and GET. SET and GET are not persistent. SET is a list of set operations performed since the accum was initialized. GET is a pair of lists. each corresponding pair represents an input query q and the memoization of q (or q, if there was no matching query). ((..;input-query;..);(..;memoized-query;..)) whenever a query q is executed mdb calls r:.memo.run[t;o] where t is the base table of q and o is the list of ops of q. the run function checks whether the first op is one of the following: memoizeset memoizeget memoizedel both 'set' and 'del' are subject to several constraints: 1) the last op in the query must be a 'materialize'. 2) the query may not contain a 'transpose' op, or a 'tabu' containing a 'cbreak', or a 'willbe', 'link', 'amend', or 'merge' containing a 'materialize'. 5. storage model. the query cache is stored in the bag server as a keytable. the domain (key) is a table with a single column: p: the path of the table materialized the cached query. the range (value) is a table with the following columns: t: the base table of the query o: the ops of the query d: the dependencies of the query x: 1 if deleted, 0 if active memoization i/o logic consists of the following code: NBS:-1 / no bag server nbs:{if[NBS<0;NBS::*.[.bsl[`connect];($.l`default_bag_server;`bs);:]];NBS} / bag server available? BAG:(`querycache;`memo`SET) / bag server key GET:(();()) / get-cache SET:() / set-cache DEF:0 / possibly not defined Get:{[t] Def` / define bag value r:.bs.select[;;(((=;`t;,t);(~:;`x));();`o`d`p)]. BAG / o,d,p where `t=t&~`x if[#r:+r[];r:qs r] / downsort by #ops r} Upd:{[x;t;o;p;d]Def`;.bs.kset[;;(.,(`p;,p);.+(`t`o`d`x;+,(t;o;d;x)))]. BAG} / set/del Set:Upd[0] / x = 0 Del:Upd[1] / x = 1 Def:{if[~DEF;.bs.define[;;(.,(`p;());.+(`t`o`d`x;(();();();())))]. BAG;DEF::1]} / define query cache Zap:{.bs.del . BAG} / zap query cache NBS is the "no bag server" flag. DEF initializes to 0, which indicates that the bag server key BAG is not known to exist. Def initializes the query cache if it doesn't exist. (we make exactly one call per accum session.) Upd takes the values of a query cache record and calls the bag-server, either replacing or appending the record, depending on whether key p exists. Set and Del are projections. Get takes a base table identifier t and returns list of fields o, d, p of undeleted records in the query cache where t=`t. both Upd and Get initialize the bag server key at most once in each accum.