It’s my weekend and I’ll try if I want to. You could try too if it happened to you. Spoiler: This takes a very weird turn. Final line: Your move, Bill!

I experimented back in XST 39 with a couple of ways to display the XPR tree, and I can’t get over how nice this one looks:

(saveAs "C" (restrict (setNamed "A") (setNamed "B")))

And when I think about substituting a tree into a tree, as I was working on yesterday, it gets hard to think about. But in the notation above, any valid tree will always have balanced parentheses, so that it is little more than a text operation to substitute

( x ( y z w) (a b c) ((d (e f) (g h))))

Into this bit:

(saveAs "C" ^(restrict (setNamed "A") (setNamed "B"))^)

It also seems to me that the list form is pretty easy to type, which will be useful for testing and perhaps for display.

Finally, it should be pretty straightforward to interpret as well. You just recur whenever you see a left paren, expect the next thing you see to be a set operation, evaluate its arguments (recurring as needed) and then do the setop.

So this morning, I plan to do a little spike on this format, just to see whether it seems like a fruitful path. And, since it looks like LISP, I’m certainly going to research LISP interpreters.

Let’s go, girls!

LISP Eval

I’m doing some reading, and will put links to the things I scan here.

Unless I say otherwise below, I’m spending ten minutes or less reading the articles linked, just scanning to get ideas. This might take longer for some readers: I do have a lot of background that not everyone has.

https://courses.cs.northwestern.edu/325/readings/interpreter.php

This describes the three phases in the LISP REPL, the Reader, the Evaluator, and the Printer.

Given a quick read, it sets an expectation that one might turn our text list into an expression list of tokens such as atomic strings, atomic numbers, symbols (like “restrict”), and lists of same. Then the evaluator munches through that expression list.

The evaluator then goes through the list, determining whether it is looking at a function name or an atom (or a number of other things we don’t want to understand yet), and thus executes the expression.

The print bit prints in a pretty obvious way.

This is interesting but I want to look at other angles.

http://norvig.com/lispy.html

This is an interesting article showing how to implement a small “Lispy” calculator and then a larger “Lispy” language. The whole LISP-like “larger” version is 117 non-blank lines of Python.

Thinking

Both these articles “tokenize” the input list before processing, doing a bit of lexical analysis to identify strings, numbers, functions, and so on.

The “real” version needs to keep track of “environments”, which are essentially nested symbol tables, analogous to the levels of local variables we might see in a more procedural language. I don’t think I’ll need that, certainly not this morning. And since all my operations are predefined, I don’t need much more than the “Lispy Calculator”, certainly not this morning.

I could drill down into the LISP ideas to any desired depth. It’s a semester-long course, or a lifetime’s work. But we don’t need much. In fact, I’m even wondering if I need tokenizing, or whether I can do the whole thing on the fly.

It’s time to choose something to do.

A Small Story

If this is to be a useful idea, not just a fun idea, we need to be able to modify and execute an expression like this:

(saveAs "C" (restrict (setNamed "A") (setNamed "B")))

We want to modify the expression because our larger mission is to replace a slow expression with a faster one. It seems to me–and I could be wrong here–that recognizing a pattern in the string, and plugging in another pattern, is likely to be “easy”.

We want to execute the expression because that’s what our product does: it executes a series of set operations to compute the answer the user is requesting.

So if this morning’s exercise is worth-while, it should lead to something easier to input, easier to modify, easier to process, than what we have now, which is a hand-created tree:

        _:test("restrict", function()
            Symbols:save("A",XSet("A"))
            Symbols:save("B",XSet("B"))
            local receiver = XPR{op="setNamed", arg1="A"}
            local restrictor = XPR{op="setNamed", arg1="B"}
            local operation = XPR{op="restrict", arg1=receiver, arg2=restrictor}
            local save = XPR{op="saveAs",arg1="C", arg2=operation}
            save:execute()
            local C = Symbols:at("C")
            _:expect(C.name).is("restrict A with B")
        end)

In the LISP-like notation we would have

( saveAs "C" ( selectValue (setNamed "A") "state" "MI" ) )

If nothing else that’s easier to type, although one does kind of wind up needing to work from the inside out.

We already have an interpreter, which is seriously simple. We just tell the top of our tree to execute:

function XPR:execute()
    if self[self.op] then
        return self[self.op](self,self.arg1,self.arg2)
    else
        error("no operation named "..self.op)
    end
end

That selfish bit in the first return fetches the op and calls it, passing the op and the arguments. The ops recursively execute their sub-parts:

function XPR:restrict()
    local receiver = self.arg1:execute()
    local restrictor = self.arg2:execute()
    return receiver:restrict(restrictor)
end

function XPR:saveAs(name,xpr)
    local result = xpr:execute()
    Symbols:save(name, result)
end

function XPR:setNamed(name)
    return Symbols:at(name)
end

So in our example above, we encounter the saveAs which expects a string as its first arg, but executes its second arg, which is the restrict.

The restrict executes both its arguments, saving the results, and then calls the real restrict operator on the resulting XSet, and returns that result.

The two arguments are both setNamed, which returns the actual XSet of that name, so the restrict gets sets to process.

Finally the restrict returns a result (an XSet) and the saveAs tucks it away in the Symbols.

That’s pretty clean, and while we could do the exact same thing in the LISP format, it may not buy us much. But we have substitution to worry about.

Our incomplete first experiment with substitution is this:

        _:test("substitute", function()
            local receiver = XPR{op="setNamed", arg1="A"}
            local fieldName = XPR{op="string", arg1="state"}
            local fieldValue = XPR{op="string", arg1="MI"}
            local operation = XPR{op="selectValue", arg1 = receiver, arg2=fieldName, arg3=fieldValue}
            local save = XPR{op="saveAs",arg1="C",arg2=operation}
            
            local match = XPR{op="selectValue", arg1=1, arg2=2}
            local replacement = XPR{op="scopeRestrict", arg1=1, arg2="XXX"}
            
            replace(save,match,replacement) -- then a miracle occurs
            
            local replaced = save.arg2
            _:expect(replaced.op).is("scopeRestrict")
            _:expect(replaced.arg1).is(receiver)
            _:expect(replaced.arg2).is("XXX")
        end)

To make this work, I have started out by cheating:

function replace(tree, match, replacement)
    -- we're on `save` and arg2 needs to be replaced. 
    -- we just "know this".
    local to_replace = tree.arg2
    local replacement = XPR{op=replacement.op, arg1=replacement.arg1, arg2=replacement.arg2}
    local args = to_replace:get_args()
    local reps = replacement:get_args()
    for i, rep in ipairs(reps) do
        if type(rep)=="number" then
            reps[i] = args[rep]
        end
    end
    replacement:put_args(reps)
    -- replacement is ready
    tree.arg2=replacement
end

Rather than search the input tree to find the match pattern, so far the code “just knows” where the pattern occurs. Some only somewhat ad-hoc code copies the arguments across into the replacement and plugs in the new operation.

In the current test, the replacement is a single set operation. To really do the job, it needs to be a subtree of multiple operations. At the top it will be a single starting setop, but under it can be any number of subtree bits. Some of those, down there somewhere, will need to be connected back to the original arguments of the matched operation.

Is a textual solution really easier?

So one key question here, with respect to the LISP idea, is whether the substitution is really easier in the LISP format. At first glance, it seems like “Just paste in whatever you have in between the matching parens of the starting operation”. But that’s not the case. We have to plug the parameters of the original into the right places in the substituted expression.

A light begins to dawn.

I begin to see something. Every substitution we make will be something we designed. We’re not going to build an artificially intelligent system that can do set theory and optimize expressions. We’re building a system where we, the humans, can say things like:

So, with a selectValue searching for a single scope and value, if there’s an index on that scope, it’s faster to return the index element for the value and use that in a scopeRestrict to pull out the records.

Then we devise the expression that does that. That new expression can only access information accessible from the starting expression’s parameters. In the selectValue case, that’s the set name, the field name, and the field value.

Suppose we were to create a Lua function, taking those three parameters, that performed the new expression’s set operations and returned its result. Something like:

function optimal(inputSet, fieldName, fieldValue)
    local infoSet = inputSet:getInfo()
    local indexes = infoSet:elementProject("Indexes")
    local ourIndexes = indexes:elementProject(fieldName)
    local index = ourIndexes:elementProject(fieldValue)
    return inputSet:scopeRestrict(index)
end

We have to hand-craft the optimizations anyway. If we save them as functions like this, we no longer have to do the crazy wiring I’m currently doing in substitute: the wiring is done via the parameter list.

But wait … the function optimal can be called this way:

optimal(aSet, "state", "MII")

But it can be called this way as well:

aSet:optimal("state","MI")

The function optimal is a perfectly good set operation. It calls a lot of other set operations, but that’s fine.

But this looks eerily familiar. There’s this code:

function XSet:selectFieldValue(scope,value, index)
    local ix = self:findSelectIndex(scope,value,index)
    if ix then
        return self:scopeRestrict(ix)
    else
        return self:select(function(record, recordScope)
            return record:hasAt(value,scope)
        end)
    end
end

function XSet:findSelectIndex(scope,value,index)
    if not index then return nil end
    local path = XSet:tuple{"INDEXES",scope,value}
    return index:elementProjectPath(path)
end

function XSet:elementProjectPath(aTuple)
    local result = self
    for pathElement,_scope in aTuple:elements() do
        result = result:elementProject(pathElement)
        if not result then return nil end
    end
    return result
end

Except for notation, unwinding a loop, and probably some typos, the optimal function that I sketched does exactly what the existing optimization for selectFieldValue does, except that the existing one also checks to see whether the optimization applies, which I’ve not yet done in the XPR spike nor addressed in this LISPy contemplation.

Where The Heck Are We?

I started down the rewrite path at the instigation of Bill Wake, whose questions about trees and rewriting got me thinking that it might be worth trying. For that to make sense, we need to posit some kind of high-level input language that gets translated into a series of set operations. It makes sense that that series would be a tree, no matter how it’s stored, because it’s likely to be an expression of operations producing sets that are operated on again and again until finally an answer comes out.

Then I imagined a “marketing requirement” that the system should be able to optimize the tree. That drove me down the path of the XPR spike and this morning’s exploration of LISP as a possible path to what we need.

However. The only example I’ve thought of so far is already optimized at the lower level, the selectFieldValue operation, which devolves into a scopeRestrict in the presence of suitable indexes. Frankly, doing that at the lower level is better because it’s more direct.

So a tree-rewriting optimizer needs a problem to solve that is larger than a single operation. Any time we see a single operation that can be improved by use of an alternate strategy, even if that strategy itself is many operations, we’ll be well off to implement that optimization right in the set op, not as a higher level tree manipulation.

So the XPR spike and the LISP consideration this morning have provided an answer, but not the answer I was anticipating.

The Spike Says

  • A tree of set operations is quite possible;
  • The tree is awkward to produce by hand;
  • LISP notation might be worth a bit more exploration;
  • Translation from some HLL1 to trees is probably not awful.
  • Single-operation optimization is better done directly in the set operations;
  • Single-operation substitution in the tree isn’t terribly difficult;
  • Multiple-operation optimization probably belongs at the tree level;
  • Doing substitutions with single “lambda” functions rather than extensive tree editing may be fruitful;
  • Examples of optimizations over multiple operations: NONE.
  • Pattern recognition of multiple operation opportunities: UNKNOWN;
  • Multi-operation optimization difficulty: UNKNOWN;

We’ve learned a lot in a few hours of coding and thinking.

We’ve identified a design “rule”, which is that optimizations of single operations is probably always best done in the set operations themselves.

Correspondingly we’ve observed that, since set operations execute immediately, optimizations spanning a sequence of operations probably must be done at a higher level than the individual set operations we now have.

However, we have come up with no examples of optimizations that are more suitable at the HLL/tree level. Accordingly, we don’t have a good sense of what the patterns would be that need to be recognized, nor what substitutions would need to be made if we found a pattern.

(We can guess that if the pattern is identified in disjoint subtrees, the substitutions will occur in one or more of those subtrees. This suggests that both the matching and substituting can be made up of collections of matches … perhaps linked by AND and OR.)

Blue Sky

What might an optimization of this kind be? Let’s try to come up with an example.

Suppose that at some point some user has collected all the records where state==MI. And suppose that later another user wants a county-wide summary of sales for Michigan.

One user might have said:

MI_records = orders:selectFieldValue("state","MI")

Later on, the other user might have said:

County_info = orders:selectFieldValue("state","MI"):stats{sum="order_total",group="county"}

A sufficiently clever optimizer might translate that into

County_info = MI_records:stats{sum="order_total",group="county"}

I think that’s a reach, but it’s certainly possible to record past expressions whose answers we have, and to look in that record to see whether there’s something to be used. But this example still comes down to a single substitution, this time of a set for (orders:selectFieldValue(“state”,”MI”)).

Spike-Based Decisions

We’ve learned things from the Spike. How do we turn that learning into decisions?

I want to revisit the “marketing decision” that says we need to demonstrate tree editing. I claim we can already demonstrate the system making optimization decisions dynamically based on the existence of things like indexes.

And I want to say that we’ll be happy to demonstrate that the decision is dynamic, by creating a set, processing it at some high cost, then indexing it, and processing it again at much lower cost.

So I claim that we’re already on the right road for the marketing promise at the upcoming Festival of Set Theory. I want to say that right now, we don’t even know of any interesting optimizations that require tree rewriting, and that while that’s a fascinating possibility, we have no need of it, at least not yet.

And I want to toss out a challenge to Bill Wake and other would-be challengers, to come up with an optimization that does require a tree kind of optimization. I’d honestly love to have one in hand, because without a test, I don’t really know how to program it.

Your move, Bill!


P.S. Does today seem like a big waste of time? I want to suggest that it wasn’t, that in fact it’s a major savings of time, because it has let us see that the vague ideas we have about trees and LISPs are not ready for prime time, and we shouldn’t invest in them until we have better questions and answers.

P.P.S. That’s why thinking is such a good idea.


  1. High-Level Language.