XST 16: Yes, but why? And how?

Further reading leads me to think about design, and design motivation. Castles in the air. Or underground. Good stuff happens.

A while back, I found xegesis.org, which contains a number of Dave Childs’ papers, which I’ve been trolling for ideas. My own experience with set operations in this series, and I suppose left over in my subconscious, led me to appreciate some of these newer works of Dave’s, to want to try them, and to wonder why one might.

I’ve been using the iPad’s neat Markup ability to scribble notes on the pages of the articles, many of which notes aren’t even “WTF?!?!?”. And yesterday’s experience has inspired further questions, possibly ideas, as well.

Yesterday I introduced a new set structure, the CSVLineSet, which is a specialized set that contains a single line from a CSVSet, and which can rapidly answer hasAt. Since hasAt is at the center of XST and of our current implementation, that gave me a 15 or 20 percent improvement in my restrict timing test.

The CSVLineSet has a property that I think is important, and for which I do not know an official set-theoretic name. I am tentatively calling it “record-like”. The set has the property that

for every scope s in the scope-set of X
if there exists an x such that X:hasAt(x,s)
then for all y such that X:hasAt(y,x), x==y

That is, if there’s anything under a given scope in this set, there’s only one thing. I think of that as being like a “record” where there’s only one value for a given field in any given record.

I found the set operation rho, written ρ, which can be used to express this idea.

rho(X,s) = x
if X:hasAt(x,s) and for all y: X:hasAt(y,s) y==x
otherwise rho(X,s) is nil

This function is called “scope projection” sometimes, and basically it says that if we ask a set for the value at some scope (field name) s, we’ll get the value if it’s unique in the set at that scope and otherwise nil.

So that’s useful and I plan to implement it as soon as I stop with this thinking drivel. You’ll see immediately why it’s useful, if it has not already sprung into your mind.

The vertical blinds on the walldoor to the deck are closed but for one tiny crack, and the sun manages to line up with that tiny crack and pierce me in the eye whenever I look in that direction. What the hell?

Anyway, our new CSVLineSet has a good “rho” property, or is “record-like” and that makes it fast to iterate and fast to answer to hasAt. I must digress.

If I were writing in set theory, and composing for LaTeX, I’d write this:

x ∈_s X

where I write X:hasAt(x,s). I have historically written elt-s for that but since the method is hasAt, I’ve been using it. But I digress. There was a question in here somewhere …

Question(s)

Since in real life, the bulk of the data a program like XST would have to process is probably relational, i.e. record-like, it might make sense to have the default data storage layout assume record-like sets, process them rapidly, and somehow use other structures when the situation dictates.

If we were writing XST in C or C++ or some other language with really good access to blocks of memory, we’d probably go with fixed-length records and a name table relating name to offset and length in the records. I don’t think we can do a very good job of that approach with Lua, but there are corresponding structures that are pretty good. CSVLineSet is one. So a big question for today is whether we should change the basic set from its current table of scopes each containing a table of values in to something more amenable, such as just a table of scope->value.

I think I want to move in that direction. There are a bunch of related questions.

I’m starting to think that I should stop thinking and start doing, but let me get a few more notions down.

In our base XSet, we have a table of scopes like this:

scope->{v1->true,v2->true,v3->true}

That inner table is usually of size 1. If that is the case, then scope->v1 would be faster, although if the default case it’s not terrible now:

function XSet:hasAt(element, scope)
    local tab = self.contents[scope]
    if not tab then return false end
    if type(element) ~= "table" then
        return tab[element] == true
    else
        for e,_true in pairs(tab) do
            if type(e) == "table" and e:equals(element) then
                return true
            end
        end
        return false
    end
end

This code is not clear, though it is correct as far as our tests can discern. The default case is that the element being checked is a string, i.e. not a table, so we just do two has accesses to determine whether that element is under the given scope. (Scopes, in this implementation, are not allowed to be XSets. They have to be strings or NULL. That is possibly a flaw. But since we have more than one kind of set now, maybe that gets handled later on.)

All of this would be better, I think, with a different basic design:

Basic Design

There are two operations that are needed for any set, hasAt, and elements. (I am sure that hasAt is required, and I believe that elements is needed as well. I know that the current implementation requires it.)

The many other set operations are all defined in terms of hasAt and elements. That suggests a different design to me.

Today, all the specialized sets, CSVSet and CSVLineSet, inherit from XSet. I think a better design would be for XSet to contain instances of the other set types. They would become storage sets and the single XSet class would manipulate them. Primarily, XSet would defer operations to them, certainly hasAt and elements but could even defer all the others, in case they were really good at something, and they’d all double dispatch right back if, as usual, they really weren’t.

I think we need to do this. Why? Because it will be better, and we’re trying to learn how to do this. This isn’t a product where we have the tools more or less under our thumb. This is an exploration where we’re learning how to write the thing we need.

I’d like to refactor to this new data structure. I just got a vague plan for how to do it. What if we insert a new class XData, in between XSet and the subclasses? Then we sever the subclass relationship between XData and XSet and make everything continue to work.

That seems almost reasonable, doesn’t it? It might even work. Let’s try it.

XData

I think we need to make this an abstract class, because we want to require all the subclasses to implement certain operations. So I’m going to do two new classes. XData will be the abstract class, and the first concrete subclass will be the one that implements our current basic sets. XGeneric will be its name.

I think I’ll write a few tests, though I plan mostly to rely on the existing ones.

        _:test("XGeneric must implement", function()
            _:expect(function()
                XGeneric()
            end).throws("Must implement init!")
        end)

The idea here is that the abstract class will include methods that error if the subclass hasn’t implemented the method.

1: XGeneric must implement  -- Actual: function: 0x28365d890, Expected: Must implement init!

XData = class(XSet)

function XData:init()
    error("Must implement init!")
end

XGeneric = class(XData)

Close enough. Now what? This seems like a massive refactoring. Let’s test drive to the point where we think that XGeneric can at least somewhat work and then plug it in.

XGeneric needs init, hasAt, addAt, and elements. And they work just like XSet now. (I’d like to get rid of addAt but I don’t see how just now.)

I’ll copy over an early test from XSet and rename:

        _:test("Is element", function()
            local A = XGeneric()
            A:addAt("Jeffries", "last")
            A:addAt("Chet", "first")
            _:expect(A:hasAt("Jeffries", "last"), "has").is(true)
            _:expect(A:hasAt("Chet", "last"),"hasnt").is(false)
            _:expect(A:isNull()).is(false)
            _:expect(A:card()).is(2)
        end)

We see already some methods that may or may not be desired on our data sets, isNull and card. XSet can generically determine the answers, but the individual classes can be faster.

Anyway … this test should fail nicely, still asking for init.

2: Is element -- XData:35: Must implement init!

function XGeneric:init()
    self.contents = {}
end

That’s lifted from XSet, as the following methods will be.

2: Is element -- XData:21: attempt to call a nil value (method 'addAt')

function XGeneric:addAt(element, scope)
    assert(scope~=nil,"nil scope "..tostring(element))
    if self:hasAt(element,scope) then return end
    if not self.contents[scope] then
        self.contents[scope] = {}
    end
    self.contents[scope][element] = true
    return self
end

2: Is element -- XData:46: attempt to call a nil value (method 'hasAt')

Need to put that in XData as well, to throw. I’ll do elements as well.

function XGeneric:hasAt(element, scope)
    local tab = self.contents[scope]
    if not tab then return false end
    if type(element) ~= "table" then
        return tab[element] == true
    else
        for e,_true in pairs(tab) do
            if type(e) == "table" and e:equals(element) then
                return true
            end
        end
        return false
    end
end

2: Is element -- XData:25: attempt to call a nil value (method 'isNull')

What about this and ‘card`? We have

function XSet:card()
    local card = 0
    for e,s in self:elements() do
        card = card+1
    end
    return card
end

function XSet:isNull()
    return not self:exists(function(e,s) return true end)
end

The former should be rewritten to use reduce. Anyway, they are implemented in the generic form, and that’ll do for now.

I think we “should” be able to plug XGeneric in now. I’m sure it’ll take a bit of effort:

function XSet:init()
    self.locked = false
    self.contents = {}
end

Well, let’s get rid of locked altogether and try this:

function XSet:init()
    self.locked = false
    self.data = XGeneric()
end

I’m just going to keep testing and fixing, possibly adding tests to XGeneric as seems needed. This fails a lot:

2: Is element -- XSet:83: attempt to index a nil value (field 'contents')

Of course all our references to contents are bad. I’ll use Codea’s Find to help me. The one in question is here:

function XSet:hasAt(element, scope)
    local tab = self.contents[scope]
    if not tab then return false end
    if type(element) ~= "table" then
        return tab[element] == true
    else
        for e,_true in pairs(tab) do
            if type(e) == "table" and e:equals(element) then
                return true
            end
        end
        return false
    end
end

Sweet, we should defer that.

function XSet:hasAt(element, scope)
    return self.data:hasAt(element,scope)
end

Test.

2: Is element -- XSet:19: attempt to index a nil value (field 'contents')

function XSet:addAt(element, scope)
    return self.data:addAt(element,scope)
end

That brings me to elements. I’ll move it and not do a test.

function XGeneric:elements()
    -- return an iterator for contents
    return coroutine.wrap(function() 
        for scope,elements in pairs(self.contents) do
            for element,_true in pairs(elements) do
                coroutine.yield(element,scope)
            end
        end
    end)
end

function XSet:elements()
    return self.data:elements()
end

The next errors are about locking. I’m going to remove that concept, and put it back if we ever actually need it. My real desire is to get rid of addAt but I seem to need it for tests.

There’s an issue with XSet:NULL which really does need to be locked. I think we can make a NULL data set that handles that. Remind me.

Hm. All the references to self.contents are gone from XSet. What will fail now?

I’m getting errors on card and equals now, and I’m not sure why. Here’s one:

3: Intersect -- Tests:39: attempt to call a nil value (method 'card')

        _:test("Intersect", function()
            local A = XSet()
            A:addAt("Jeffries", "last")
            A:addAt("Ron", "first")
            _:expect(A:card(), "#A").is(2)
            local B = XSet()
            B: addAt("Ron", "first")
            _:expect(B:card(), "#B").is(1)
            local C = A:intersect(B)
            _:expect(C:hasAt("Ron", "first")).is(true)
            _:expect(C:card()).is(1)
            _:expect(C:isNull()).is(false)
        end)

Line 39 is the last one. I wonder what we got back from the intersect? XSet understands card. Better check.

Intersect has returned an XGeneric, not an XSet. What’s up with that?

function XSet:intersect(anXSet)
    return self:select(function(e,s) return anXSet:hasAt(e,s) end)
end

function XSet:select(predicate)
    return self:reduce(XSet(), function(r,e,s)
        return predicate(e,s) and r:addAt(e,s) or r
    end)
end

function XSet:reduce(first,f)
    local result = first
    for e,s in self:elements() do
        result = f(result, e,s)
    end
    return result
end

Hm, here of course is where we don’t actually love reducing everything to reduce because it’s hard to think about. I think that the bug is that we’re returning the generic from addAt. I’m not sure why I think that.

function XSet:addAt(element, scope)
    return self.data:addAt(element,scope)
end

Yes. Good guess. Fix:

function XSet:addAt(element, scope)
    self.data:addAt(element,scope)
    return self
end

Test.

2: Is element -- XData:25: attempt to call a nil value (method 'isNull')

That’s my own test and I’m not going to require that method.

        _:test("Is element", function()
            local A = XGeneric()
            A:addAt("Jeffries", "last")
            A:addAt("Chet", "first")
            _:expect(A:hasAt("Jeffries", "last"), "has").is(true)
            _:expect(A:hasAt("Chet", "last"),"hasnt").is(false)
            _:expect(A:isNull()).is(false)
            _:expect(A:card()).is(2)
        end)

Remove the check for null and card. Those belong to XSet.

Tests are green. Commit: XSet now uses XGeneric for data storage.

Sorry, but now we need to think some more.

Other Data Structures

The idea here is that XSets will have a data storage object, not be one. So our new CSVSet and CSVLineSet need to create XSets containing instances of themselves.

I guess the first thing I’ll do is reroot them at XData rather than XSet. That should break something.

It does, several messages all reading like this:

15: CSVSet restrict -- Tests:216: attempt to call a nil value (method 'restrict')

OK this is interesting:

        _:test("CSVSet restrict", function()
            local csv = CSVSet(CSVnames, CSVminidata)
            local brighton = XSet()
            brighton:addAt("Brighton", "city")
            local restrictor = XSet()
            restrictor:addAt(brighton,NULL)
            local restricted = csv:restrict(restrictor)
            _:expect(restricted:card()).is(1)
            local record = NULL
            for e,s in restricted:elements() do
                record = e
            end
            _:expect(record:hasAt("Josephine","first_name")).is(true)
        end)

The issue, I’d say, is that we want to create an XSet here that has a CSVSet as its data.

Let me try that longhand to begin.

This seems to work:

        _:test("CSVSet restrict", function()
            local csvData = CSVSet(CSVnames, CSVminidata)
            local csv = XSet()
            csv.data = csvData
            local brighton = XSet()
            brighton:addAt("Brighton", "city")
            local restrictor = XSet()
            restrictor:addAt(brighton,NULL)
            local restricted = csv:restrict(restrictor)
            _:expect(restricted:card()).is(1)
            local record = NULL
            for e,s in restricted:elements() do
                record = e
            end
            _:expect(record:hasAt("Josephine","first_name")).is(true)
        end)

Now I get this error:

16: CSV big restrict -- Tests:233: attempt to call a nil value (method 'restrict')

That’s here:

        _:test("CSV big restrict", function()
            local csv = CSVSet(CSVnames, CSVdata)
            local mi = XSet()
            mi: addAt("MI","state")
            local restrictor = XSet()
            restrictor:addAt(mi,NULL)
            local restricted = csv:restrict(restrictor)
            _:expect(restricted:card()).is(14)
        end)

I feel like I should be able to do the same thing here. I do and then:

16: CSV big restrict -- XData:86: attempt to call a nil value (method 'equals')

I expect that’ll be interesting. But what may be going on is confusion in the creation of the CSVLineSet. We need a way to create an XSet on a different kind of data. Let me change the tests:

        _:test("CSVSet restrict", function()
            local csv = XSet:on(CSVSet(CSVnames, CSVminidata))
            local brighton = XSet()
            brighton:addAt("Brighton", "city")
            local restrictor = XSet()
        ...

And

function XSet:on(aDataSet)
    local result = XSet()
    result.data = aDataSet
    return result
end

I expect the same errors. Now to find where I create that CSVLineSet.

function CSVSet:createLineSet(line)
    return CSVLineSet(self.nameToIndex, line)
end

That becomes:

function CSVSet:createLineSet(line)
    return XSet:on(CSVLineSet(self.nameToIndex, line))
end

Test, hopefully. Hopes are somewhat dashed:

22: CSV restrict operator -- Tests:301: attempt to call a nil value (method 'restrict')
23: CSV large restrict operator -- Tests:316: attempt to call a nil value (method 'restrict')

Ah, those guys are not using my new on method. Nice. Plug that in and the tests are green.

Commit: CSVSet and CSVLineSet now rooted in XData.

What Have We Wrought?

Well, the good news is that we have separated the set operations, restrict, union, and so on, from the set data, where hasAt and elements happens.

We can do better. We do have a specialized card for CSVSet:

function CSVSet:card()
    local pat = ".-\n"
    local rep = "X"
    local nls = self.data:gsub(pat,rep)
    return nls:len()
end

This converts the many lines of a CSVSet into a series of X’s and counts them. Let’s allow card to be overridden in our sets.

How can we do that? We could test in XSet:card to see whether our data knows card. I know, let’s implement card on the XData class to return nil, which is never a cardinality.

Then:

function XSet:card()
    return self.data:card() or self:longCard()
end

If the set returns a non-nil, we’ll use it, otherwise:

function XSet:longCard()
    local card = 0
    for e,s in self:elements() do
        card = card+1
    end
    return card
end

Tests are green. I can’t help wondering if we got to the card in CSVSet. I just tossed in a print. It printed. I’m not sure how I could have tested that more reasonably.

Let’s fix longcard to use reduce.

function XSet:longCard()
    return self:reduce(0, function(r,e,s) return r+1 end)
end

Now let’s combine these two methods back together:

function XSet:card()
    return self.data:card() or self:longCard()
end

function XSet:longCard()
    return self:reduce(0, function(r,e,s) return r+1 end)
end

This is the “inline method” refactoring, for those keeping score at home:

function XSet:card()
    return self.data:card() or self:reduce(0, function(r,e,s) return r+1 end)
end

Green. Commit: allow XData subclasses to implement card. Use reduce in general XSet card.

I find the old baseRestrict method, used in testing. Fold that back into restrict. Green. Commit: remove baseRestrict method.

Oh. I wanted to do something about Null, so that you can’t add records to it. Let’s write a test.

        _:test("Can't add to NULL", function()
            NULL:addAt("foo","bar")
            _:expect(NULL:card()).is(0)
        end)

27: Can't add to NULL  -- Actual: 1, Expected: 0

I was going to create a new XData subclass for that object, but I wonder if we could give it a special card method and such instead. That’s simple in Lua:

XSet.NULL = XSet()
NULL = XSet.NULL
NULL.addAt = function() return NULL end
NULL.card = function() return 0 end
NULL.elements = function() return pairs({}) end
NULL.hasAt = function() return false end

I’ve given NULL its own private methods that “just don’t care”. Those are overkill, but in for a penny. We might as well provide a solid definition for NULL right in the code.

Green. Commit: NULL protects itself and documents its “behavior”.

Let’s sum up. We’ve done a good thing here.

Summary

In a pretty simple refactoring, we’ve separated the set operations entirely from the data structures that hold, well, the data that we process. In XSet we implement:

__tostring,
addAt,
card,
copyInto,
display,
elements,
equals,
every,
exists,
hasAt,
hasntAt,
init,
intersect,
isNull,
isSubset,
is_a,
on,
reduce,
restrict,
select,
union

And in the data sets we are only required to implement hasAt and elements, though we can implement other methods such as card and addAt if we desire.

Given those two key methods, we can make our set operations operate on any data structure we desire, subject to the constraint that the output sets we may create will always be in our generic data format. We have provided no ability to choose the output format. Perhaps we should.

This is a very good thing. We now have a collection of set-oriented operations that can operate on any form of data whatsoever, given only that we can implement those two simple methods, “do you have this in you” and “tell me all your elements and scopes”.

We have an example of a set method, card, implemented in a data set, and could readily allow for other custom methods. That said, our best effort to date on that wasn’t very effective, and implementing a better sub-structure, the CSVLineSet, with just hasAt, was better. So much for cleverness and speculation.

Converting to this new structure was pretty straightforward. We did just a bit of TDD, enough to get the base class and the generic data class in place, and then our existing tests took over and led us right to the things that needed changing. This has to be a triumph for the TDD approach, as the existing tests, plus a few local ones, supported a pretty serious refactoring.

It’s also good that we started this as soon as we did, and of course if we were smarter (that is, if you were here with me), we might have done this right off the bat. Me, I’m happy to do something that isn’t quite right and then improve it. I don’t have to guess right before I even know anything.

Looking forward, I have a couple of things in mind.

I’d like to provide a better initial structure for the data than the current generic. I freely grant that I am not sure it’ll be that much better, so I’ll try to remember to experiment to see if it’s worth the added complexity.

I’d like to have a way for an XSet to change its data structure on the fly. We might use that for creation of indexes or something like that. That will require either a hook back to the XSet that contains a given structure, or some similar scheme down at the data level. I don’t see a good way to do that just now.

And I’d like to have some more set operations, in particular ones that will be useful for getting data out of the system, and for transforming sets from shapes we have to shapes we like.

Finally, I am sorely tempted to create an expression tree of some kind that the code can dynamically refactor, to make good things happen. If I get there, I’ll have accomplished something that I myself never have before, although folks who worked under/with me have done it. They were much smarter than I am, and probably worked harder as well.

I think of them fondly.

Anyway, that’s what I’m looking at. For today, I need to finish reading Time Burrito, which is about as good as it sounds.

See you soon, I hope!