XST 22: Moving Toward Indexing

Step by step, inch by inch, slowly we turn long searches into more direct accesses.

I’ve been reading some of Dave Childs’ more recent articles on xegesis.org. He has some material that I’ve not seen on functions and processes. So far I’ve not figured it out, which means you are spared–for now. It does look interesting, if I can figure out what it means.

Meanwhile, back on Planet Earth, we’re working toward having a kind of index set, and creating set operations that can discover the existence of indexes and use them, in a more or less credible kind of sort of set-theoretic fashion.

As is often the case for me, and I think it’s actually good to work this way, I’m creating the simplest bare-bones solution I can manage, and then refining it into something more robust and general. To me, this is a good way to go, because the fewer balls we have to juggle at a time, the easier it is. If we can get it down to one or two, we can even hold on to them, avoiding most of those embarrassing moments when they all fall down.

Many More Much Smaller Steps, as GeePaw Hill puts it.

At this point, if my memory and reading of the tests is accurate, we have some useful components for use in indexing.

We have X:scopeSet(), which returns a set containing all the scopes of the elements in a given set. Given

{ a⁵, b^“hi”, c⁵ }

scopeSet returns

{ 5, “hi” }

That’s useful because it is one way of getting the indexes of some records.

We have A:restrictToScope(B), which returns the scopes of all the records that would be found by A:restrict(B). This result set is also an “index” into A.

And we have A:scopeRestrict(B), which returns all the records from A whose scopes are in B. This amounts to an indexed retrieval, where scopes are “record numbers”.

Finally, we have the set type Tuple which is an array of elements, indexable by the integers from 1 to n. This set returns an indexed element without a search.

Now the fact is, the search for an indexed element even in our ordinary sets amounts to no more than a couple of hash probes and a search of a tiny array, so I consider the Tuple more of a proof of concept than anything really useful for an in-memory system like we’re building here. But it should be easy to see how Tuple could be generalized to get a page number from a desired scope, and a record position inside that page, so as to fetch just the record one wanted.

Let me mention here that I hate the two names restrictToScope and scopeRestrict. They are too similar, making it hard to remember which one is which. But possibly, at some future time, we wouldn’t need to remember. We have a test that looks like this:

...
            local miFolksScopesHardWay = S:restrict(restrictor):scopeSet()
            local miFolksScopes = S:restrictToScope(restrictor)
            _:expect(miFolksScopes:card(),"scopes").is(14)
            _:expect(miFolksScopesHardWay:card(),"hard").is(14)
            _:expect(miFolksScopesHardWay:equals(miFolksScopes),"sets equal").is(true)

It is the case that A:restrict(B):scopeSet() is equal to A:restrictToScope(B). So a sufficiently clever system would observe that we were going to do the former and execute the latter. If I become more clever myself, we might manage to implement that.

Digression

I just got a small idea as to how to do that.

Notice the syntax used in CodeaUnit expect statements. You can say things like:

_:expect(X:card()).is(14)
_:expect(myTable).has("hello")
_:expect(f).throws("bad thing happened")

We might be able to look into how CodeaUnit manages to cascade things, and do something similar. Or, we could “cover” all the operators, so that

A:restrict(B):scopeSet(),

rather than executing the operation, created a structure of some kind, to be interpreted later.

Those may be the same idea: I don’t know. I haven’t worked on it yet.

End Digression

Next Indexing Steps

Literally days ago, we implemented the XSet subclass XExpression. XExpression contains an XSet, and that XSet looks like this:

{ {actual-data}^DATA }

That is, the actual data for the set (which will itself be some kind of XSet, will be found under the scope DATA, inside the XSet that the XExpression contains.

I’ll wait while you unwind that recursion.

OK. Now the plan, such as it is, is that there can be one or more other sets inside there, perhaps a set under the scope of INDEXES, or under INFO, where the system keeps sets that are, well, indexes or other useful info about the set represented.

The XExpression class includes this method:

function XExpression:findDataSet(aSet)
    local data = nil
    for e,s in aSet:elements() do
        if s=="DATA" then
            if data == nil then
                data = e
            else
                error("more than one DATA")
            end
        end
    end
    if data and data:is_a(XSet) then
        return data
    else
        error("no DATA")
    end
end

It seems to me that we now have a better way of doing that, because we have the set operation elementProject, which, given a scope, returns the unique element under that scope, or nil. Let’s use it. We probably have to accept simpler messages, but I think I’d prefer the more compact code that we’ll likely get.

function XExpression:findDataSet(aSet)
    local data = self:elementProject("DATA")
    if data and data:is_a(XSet) then
        return data
    else
        error("no DATA")
    end
end

I expect to get one test failure here. Let me change the message:

function XExpression:findDataSet(aSet)
    local data = self:elementProject("DATA")
    if data and data:is_a(XSet) then
        return data
    else
        error("XEpression must contain unique DATA scope")
    end
end

Now at least two tests should break.

I do get the two that I expected, but I also get this:

3: Expression Set -- XData:92: stack overflow

I am not as fond of that as I might be. Oh. Should have said:

function XExpression:findDataSet(aSet)
    local data = self.xset:elementProject("DATA")
    if data and data:is_a(XSet) then
        return data
    else
        error("XEpression must contain unique DATA scope")
    end
end

Now I have what I expected:

4: Expression Set with no DATA throws  -- Actual: function: 0x28dd4f0c0, Expected: no DATA
5: Expression Set with too many DATA throws  -- Actual: function: 0x28dd4e490, Expected: more than one DATA

Both should now check for the more generic message.

        _:test("Expression Set with no DATA throws", function()
            local data = XSet():addAt("foo","bar")
            local notSmart = XExpression(data)
            local P = XSet:on(notSmart)
            _:expect(function() P:card() end).throws("XEpression must contain unique DATA scope")
        end)
        
        _:test("Expression Set with too many DATA throws", function()
            local data = XSet():addAt("x","DATA"):addAt("y","DATA")
            local tooSmart = XExpression(data)
            local P = XSet:on(tooSmart)
            _:expect(function() P:card() end).throws("XEpression must contain unique DATA scope")
        end)

Green. Commit: XExpression uses elementProject to get DATA.

Why is this better? Well, it’s certainly more compact. It expresses the idea more clearly, given that we know what elementProject does: get the unique item under DATA. It’s less code and therefore clearer. It’s more obviously set-theoretic, which is our design metaphor here.

Now let’s devise a test. I think the idea will be that we’ll build up an index structure representing an index on “state”, and then we’ll do our usual search for Michigan. When we get the right records back, we’ll see what we can do to make sure we went through the correct code.

But how will we detect that we need the state index? Will we do it by analyzing the restrictor set to see if it contains only elements like foo^state?

Suppose, just for fun, that we had an operator like restrict that required and index and if it didn’t find one, created it and then used it. Or, simpler still, we could begin with a new operator indexedRestrict whose implementation is to use the indexing information, if it is applied to an XExpression set (the only kind of set that can have an index). Or, we could override restrict in XExpression … except that I don’t know how to call the superclass if I need to.

We’ll go the indexedRestrict route and then do as we did with the sets that implement their own card:

function XSet:card()
    return self.data:card() or self:reduce(0, function(r,e,s) return r+1 end)
end

If the set’s data knows care, we use it. The abstract class returns nil for card() so it all unwinds correctly.

OK we have a plan.

Indexed Restrict

Let’s TDD this. I plan to proceed kind of from the inside outward, first just providing the index set, then finding it, and so on.

I start with this test:

        _:test("indexedRestrict", function()
            local r1 = XSet():addAt("Ron","name"):addAt("MI","state")
            local r2 = XSet():addAt("Bill","name"):addAt("VA","state")
            local r3 = XSet():addAt("Pat","name"):addAt("MI","state")
            local r4 = XSet():addAt("Chet","name"):addAt("GA","state")
            local people = Tuple():add(r1):add(r2):add(r3):add(r4)
            local mi = XSet():addAt(1,NULL):addAt(3,NULL)
            local va = XSet():addAt(2,NULL)
            local ga = XSet():addAt(4,NULL)
            local indexes = XSet():addAt(mi,"MI"):addAt(va,"VA"):addAt(ga,"GA")
            local descoped = people:scopeRestrict(mi)
            _:expect(descoped:card()).is(2)
        end)

Here I’ve built up a people set and three state indexes, and a set containing all the indexes. The test just checks to be sure that I got 2 elements from using the scopeSet operation.

Green.

Now to test the fetching. I’m sort of writing the operation here in the test.

        _:test("indexedRestrict", function()
            local r1 = XSet():addAt("Ron","name"):addAt("MI","state")
            local r2 = XSet():addAt("Bill","name"):addAt("VA","state")
            local r3 = XSet():addAt("Pat","name"):addAt("MI","state")
            local r4 = XSet():addAt("Chet","name"):addAt("GA","state")
            local people = Tuple():add(r1):add(r2):add(r3):add(r4)
            local mi = XSet():addAt(1,NULL):addAt(3,NULL)
            local va = XSet():addAt(2,NULL)
            local ga = XSet():addAt(4,NULL)
            local indexes = XSet():addAt(mi,"MI"):addAt(va,"VA"):addAt(ga,"GA")
            local descoped = people:scopeRestrict(mi)
            _:expect(descoped:card()).is(2)
            local desiredIndex = indexes:elementProject("MI")
            local usedIndex = people:scopeRestrict(desiredIndex)
            _:expect(usedIndex:equals(descoped)).is(true)
        end)

Here I used elementProject to get the right index and used it, checking to be sure I got the same records as before. Still green.

A wild idea has appeared!

At this point I get a possibly clever idea. Suppose we were to create a new set operation selectFieldValue(name,value). We could defer the analysis of the data set until later on in our lives, and the operation itself would be rather handy.

Let’s do. And let’s give it an optional third parameter. Hold on, you’ll see it happen.

...
            local selected = people:selectFieldValue("state","MI", indexes)
            _:expect(selected:equals(descoped)).is(true)

This fails for not having the method:

6: indexedRestrict -- XData:69: attempt to call a nil value (method 'selectFieldValue')

Implement this useful method at the very top:

function XSet:selectFieldValue(scope,value)
    return self:select(function(record, recordScope)
        return record:hasAt(value,scope)
    end)
end

So that’s nice. And we’re still green. Now let’s use the provided index:

function XSet:selectFieldValue(scope,value, index)
    local ix = index and index:elementProject(value)
    if ix then
        print("used index")
        return self:scopeRestrict(ix)
    else
        return self:select(function(record, recordScope)
            return record:hasAt(value,scope)
        end)
    end
end

This prints “used index” and runs green. I’m almost surprised.

One more thing, let’s make it expect a set of index sets, whose scopes are the scopes indexed upon. We’ll have to enhance the test for this:

        _:test("indexedRestrict", function()
            local r1 = XSet():addAt("Ron","name"):addAt("MI","state")
            local r2 = XSet():addAt("Bill","name"):addAt("VA","state")
            local r3 = XSet():addAt("Pat","name"):addAt("MI","state")
            local r4 = XSet():addAt("Chet","name"):addAt("GA","state")
            local people = Tuple():add(r1):add(r2):add(r3):add(r4)
            local mi = XSet():addAt(1,NULL):addAt(3,NULL)
            local va = XSet():addAt(2,NULL)
            local ga = XSet():addAt(4,NULL)
            local indexes = XSet():addAt(mi,"MI"):addAt(va,"VA"):addAt(ga,"GA")
            local stateSet = XSet():addAt(indexes,"state")
            local allIndexes = XSet():addAt(stateSet,"INDEXES")
            local descoped = people:scopeRestrict(mi)
            _:expect(descoped:card()).is(2)
            local desiredIndex = indexes:elementProject("MI")
            local usedIndex = people:scopeRestrict(desiredIndex)
            _:expect(usedIndex:equals(descoped)).is(true)
            local selected = people:selectFieldValue("state","MI", allIndexes)
            _:expect(selected:equals(descoped)).is(true)
        end)

This is pretty intricate but I think it’s the goods. Should still work but not print “used index”

Does that, after I fix a typo. The problem now is that our index set is too deep.

function XSet:selectFieldValue(scope,value, index)
    local ix = self:findSelectIndex(scope,value,index)
    if ix then
        print("used index")
        return self:scopeRestrict(ix)
    else
        return self:select(function(record, recordScope)
            return record:hasAt(value,scope)
        end)
    end
end

function XSet:findSelectIndex(scope,value,index)
    if not index then return nil end
    local indexes = index:elementProject("INDEXES")
    if not indexes then return nil end
    local fieldIndex = indexes:elementProject(scope)
    if not fieldIndex then return nil end
    return fieldIndex:elementProject(value)
end

This is a bit irritating. It also works.

I’m going to commit: selectFieldValue can use a provided index set.

We’re not quite there yet. We are providing the index set to selectFieldValue and we would really like for it to find the indexes if they exist. But we’re quite close. The only kind of set that can have indexes of this kind is the XExpression set. So we can do our card trick and implement something in XExpression that will provide what we need, which will just be its own contents, because that has INDEXES and DATA in it. Unless we change the rules for that format.

I’d like to change that long series of if statements, but because elementProject returns nil, I can’t cascade the calls. I had wanted to but it was not to be. It’ll work when they’re all there but will send elementProject to a nil if they’re not. I may have an idea for how to work around that, but not for today.

I think we’ll pause right here for the morning’s article. I’m two hours in and that’s enough time heads down without a break. Let’s take a look at where we are.

Summary

We have a new set operation selectFieldValue, which is equivalent to restrict in the case that the restricting set is a singleton. We might want to make use of that fact in optimizing restrict but that’s for another day. The set operation currently takes a third “hidden” parameter in the soon-to-be-official form for indexes:

{ { {scopes}^v1,{scopes}^v2}^field} … } … }

Or something like that. Basically it’s

contents =   
    DATA -> data    
    INDEXES ->    
        FIELDNAME ->    
            V1 -> { scopes }    
            V2 -> { scopes }
        ANOTHERFIELDNAME -> ...

Easily parsed but not easily typed in.

Note the similarity of our new operation to

SELECT * FROM DATA WHERE STATE="MI"

Eerily familiar I hope.

It should be “easy” (q.v.)¹ to look at a restrict and if it’s suitable, convert it to a selectFieldValue to take advantage of an index. And there are other tricks, because we know things like “the restrict of A U B = r(A) U r(B).

What is most striking to me, and you probably can’t feel it from there, is how smoothly all this has gone. All those operations have worked the first time, subject only to typographical errors like misspelling a word or leaving out a colon before an addAt. It’s a tribute to the central idea that Childs had, that set theory was a good way to express these kinds of operations. And, of course, it calls back to the relational algebra of Codd as well, which has similar properties.

Note, however, that relational operations are not sufficient to do what we’re doing here. There is no corresponding idea in relational to a set containing a set, and we’re using sets of sets all the time now. Again, Childs had the right idea.

I just finished reading “Enchantress of Numbers”, by Jennifer Chiaverini, a very interesting novelization of the life of Ada Lovelace. It includes, of course, the story of Charles Babbage, whose Difference Engine and Analytical Engine would have been the first programmable computers, had he been able to complete them. Lack of funds was the main reason they didn’t get built, since they’d have been the size of a large room and would have required a steam engine to run all the gears.

In a way, I feel similarly toward Dave and XST. It’s a valid idea, of substantial value, that has never managed to get the support that it needed to go mainstream–and to provide for Dave the benefits that successful deployment would have provided. And as we can see here, it doesn’t require a room and a steam engine. It works just fine on my battery operated iPad.

So let us give a tip of the hat to Dave, and to ourselves, because what we have here is a set operation using an index which is itself expressed in sets. And we only got one stack overflow, and that was due to a missing colon.

I hope to see a few of you next time!

XSet2.zip CodeaUnit2.zip

“Easy”, of course, is a trigger word, as is “just”. It’s almost guaranteed to hide more work than we imagined. Still, it’s probably not terrible difficult. We’ll surely find out. ↩