Query Memoization in 1010
=========================
0. Introductory.
The present work on memoizing queries stems from earlier work on query normalization.
The leading idea of normalization is this: if Q and R are queries such that &/Q _lin R and &/R _lin Q (i.e. Q and R are lists of identical ops, but possibly not in the same order) and the result of Q = the result of R, then there is a query N, the normal form of Q and R, which reorders the ops, inserting colord ops where necessary, such that the result of N = the result of Q (and hence R).
Section 1 describes the memoization problem in detail.
Section 2 describes the original version of the algorithm.
Section 3 explains the problematic nature of attempting to memoize normalized queries.
Section 4 works through a few examples.
Section 5 describes the revised version of the algorithm.
Section 6 describes the algorithms used in the 'set' and 'get' functions.
1. Memoization
The memoization problem is simply this: Given a query q, is there a stored query which matches the head of q? The head of q is not necessarily the first n ops in the query as given. Rather, it is the longest contiguous subset of ops in the query which can be evaluated before the rest of the query without altering the result. Thus, if Q = abcDEFghi and R = DEFabcghi, and R has the same result as Q, and DEF is the longest contiguous subset of ops in Q of which that can be said, then DEF is the head of Q. (This description has to be amended to take account of two complications: queries containing links and/or merges with ops ("nested queries"), and queries containing base ops. The algorithm described in the next section deals with both possibilities.)
The intuition, or leading idea, is that we can replace an independent subset of the ops in a query with a reference to the table which that subset computes. In the limit case, the entire query can be replaced with a reference to the table.
The "use case" which motivates the work is this one. Q is:
base table = T
select store = a
X
We teach our users to do an initial selection to cut down the number of rows. Then X has less work to do. But instead, suppose we compute X for all stores, and stick the resulting table in U. This could be very expensive, but we suppose that it's done overnight for data which doesn't change all that often. Then, when the user submits the query Q, we recognize that X matches a stored query (viz. the one which computes U) and replace Q with R:
base table = U
select store = a
That should be very quick, since we've precomputed all the expensive stuff.
2. The Algorithm (original version).
There are two entry-points: 'set' and 'get'. These are called from the Cover function, thus:
.memoize.set[table]ops
q:.memoize.get[table]ops
table:*q;ops:q 1
The 'set' function does the following: if the last op in the query (*|ops) is a materialize with memoize = 1, then the following information is added to the bag server:
key = table
value = (table;-1_ ops;deps;path)
where deps contains timestamps for every table-reference in the query and path is the location of the table written by materialize.
The 'get' function takes the table and the ops of the input query and searches in the bag for that table for the longest stored query which matches a contiguous subset of the input query. If such a stored query exists, it is treated as the head of the transformed query, and the 'get' function returns ops minus the head and the path from the matched query in the bag server.
There is one catch of course.
Suppose the stored query Q contains a sticky op, e.g.
1994\"/>
Then Q cannot be the head of any input query which has a cache-affecting operation which occurs before Q, e.g.
50\"/>
1994\"/>
3. Memoizing without normalization.
One advantage of memoizing without normalization is that since normalization requires knowledge of column output for every op in the query, we are forced to exclude queries which contain transpose ops and tabus with column-breaks. Without normalization the only queries which cannot be memoized are ones containing materialize ops or willbe, link, or merge ops with materialize = 1. And that limitation makes perfect sense, since a memoized query will not re-execute the code of its head.
There is in fact a serious disadvantage to trying to memoize with normalized queries, apart from the complexity and fragility of the process (fragile because with new ops and/or attributes the normalization code must be revisited.)[1] Here's an example where matching would fail with normalization where it should succeed. Suppose the stored query Q starts with a willbe on which nothing subsequently depends (i.e. it isn't referenced in a subsequent op). Let X be the remaining ops in the stored query:
"
query[h;ENV.TAB`autosales;qry;_n]
qry:""
query[h;ENV.TAB`autosales;qry;_n]
qry:"
"
query[h;ENV.TAB`auto3;qry;_n]
qry:"
"
query[h;ENV.TAB`auto2;qry;_n]
qry:"
"
query[h;ENV.TAB`autosales;qry;_n]
we have the following tree:
xyz
auto3
autosales
ott
auto2
autosales
abc
autosales
(We'll use this structure to test the dependency logic in the match function.)
Next, we define a query whose sub-queries reference the stored queries:
qry:"
abc
abc
abc
1993\"/>
xyx
xyz
xyz
ott
ott
ott
"
query[h;ENV.TAB`autosales;qry;_n]
This query is recursively memoized (internal form, abbreviated) to:
(`.db.mdb_regression_tester.mdb_regression_test_folder.abc
((`sel
:
(`link
.((`table2;`.db.mdb_regression_tester.mdb_regression_test_folder.ott;)
:
(`ops
((`sel
:
(`link
.((`table2;`.db.mdb_regression_tester.mdb_regression_test_folder.xyz;)
:
(`ops
,(`sel
:
(`willbe
.((`name;`abc;)
:
That is, all the willbes have been precalculated in unselected state and stored in abc, ott, and xyz. Only the three selections and the terminal willbe are retained from the original query.
The 'get' function will also recursively check for base operations and replace the prevailing base table (or in the case of links and merges, the value of the table2 attribute). all ops up to and including the last occurrence of base are then discarded.
Cover00 has been modified to contain the following lines, following the call to ProcessMacroMacros:
.mm.set[table]ops
q:.mm.get[table]ops
if[~ops~q 1
.opstack,:.((`type;`memoize);(pre;(table;ops));(`post;q))
table:*q;ops:q 1]
5. The Algorithm (revised version).
The version of memoization described above tries to match a stored query to the longest contiguous subset of ops in the input query which can be evaluated before the rest of the query without altering the result. Moreover, the first version will not memoize queries containing sticky functions (g_xyz).
The current version of the algorithm relaxes both restrictions.
Suppose the stored query Q has been evaluated on source table t and the result table stored as u:
Q: A B C D E
and suppose the input query, also to be evaluated on t, is
q: x y A z B C u v w D E m n o
then Q matches q if the result of evaluating:
A B C D E x y z u v w m n o
is identical to the result of evaluating q. In that case, the query
r: x y z u v w m n o
is evaluated on u.
Regarding the second restriction, suppose the stored query Q (stored in u) is:
and suppose the input query q on t is:
We say that Q matches the second op in q because q matches the evaluation of:
so the query
can be evaluated on u.
This is so because g_sum on break-column city is indifferent to whether the selection occurs before or after the g_sum.
Now consider the following example:
Table t has columns state and city, and no city belongs to more than one state. The store column Q is as above, and the input query q is:
The canselect attribute lists those columns whose extensions are supersets of the extensions of the break columns. We can therefore know that the evaluation of q will match the evaluation of:
and therefore that we may replace q on t with r on u:
NB:
The original version of the algorithm guarantees that the column order of the result query matches the column order of the input query. The current version does not. For example, suppose the input query q is:
a B
where a and B are willbes with names x and y, and Q is:
B
then the resulting query r will consist of:
a
The column order of q will be x y, and that of r will be y x.
We have four options in dealing with this:
1. trace the resulting columns for each op in q, then append a colord to the resulting query r.
2. disqualify input queries which do not terminate with a colord.
3. disqualify input queries which do not contain the meta memoize = 1.
4. do nothing, but document the behavior.
6. Analytical Commentary.
The 'set' function takes two arguments: a table t and a list of ops o.
set:{[t;o]
If t is not a temporary table ..
if[~tt t
and if there are ops ..
if[#o
and if there are no block-language elements ..
if[~bm o
then let q = ops following the last 'base' op and let t = the base table in that op:
q:rebase[t]o;t:*q;o:q 1
The 'type' function returns -1, 0, or i>0. -1 means "doesn't qualify for memoization", 0 means "do not memoize", i>0 means "the i-1th op = 'materialize' with 'memoize' = 1.
if[0