XST 15: An experiment. An assessment. A plan, perhaps.

In conversation with Bill Wake and with the Internet, I have an idea for something to try. And I’m just about ready to assess where we are and where we should go.

On a Slack we inhabit, Bill and I tossed around some ideas. Not as fruitful as it might have been in a real chat but some understanding and ideas came out. I’ll come back to some of those below.

One more direct outcome came from the Rubber Duck Theory of Software Development. In explaining some details to Bill, I got an idea for something interesting to try. I subsequently even tried to express the idea in set theory.

Let’s look at the base algorithm for restrict, the one that runs at the XSet level:

function XSet:baseRestrict(B)
    -- return all elements (a) of self such that
    --  there exists a record (b) in B such that
    --      b:subset(a)
    return self:select(function(a,s)
        return B:exists(function(b,s) 
            return b:isSubset(a) 
        end) 
    end)
end

That’s just about a perfect transcription of the formal definition into Lua. It leads us to ask about isSubset:

function XSet:isSubset(other)
    -- return whether self is a subset of other
    -- for every element e,s of self
    --   is e element-s of other
    return self:every(function(e,s) 
        return other:hasAt(e,s) 
    end)
end

So restrict comes down to a whole raft of calls to hasAt, the method that asks whether a set has a given element at a given scope. If hasAt is slow, isSubset is slow and restrict is slow.

We have a bunch going on in a restrict against a CSVSet, including this:

function CSVSet:elements()
    -- return an iterator for contents
    local pat = ".-\n"
    return coroutine.wrap(function() 
        for line in self.data:gmatch(pat) do
            local converted = self:convertToXSet(line)
            coroutine.yield(converted,NULL)
        end
    end)
end

This code finds a line in the CSV “file”, and converts it to an XSet, and returns it (assuming scope == NULL). The conversion is a bit nasty:

function CSVSet:convertToXSet(line)
    local result = XSet()
    local nameCount = 1
    pat = '"(.-)"[,\n]'
    for unquoted in line:gmatch(pat) do
        local label = self.labels[nameCount]
        result:addAt(unquoted, label)
        nameCount = nameCount + 1
    end
    return result:lock()
end

I suspect, but have not measured, that much of the cost of this operation is in the addAt, putting the individual items into the result XSet,

function XSet:addAt(element, scope)
    if self.locked then error("set is locked") end
    assert(scope~=nil,"nil scope "..tostring(element))
    if self:hasAt(element,scope) then return end
    if not self.contents[scope] then
        self.contents[scope] = {}
    end
    self.contents[scope][element] = true
    return self
end

This operation checks each incoming element and scope to see if they are already present in the XSet, to avoid duplicates, which are not a mathematical problem but undesirable from a programming viewpoint. That hasAt call in there resolves to this:

function XSet:hasAt(element, scope)
    local tab = self.contents[scope]
    if not tab then return false end
    if type(element) ~= "table" then
        return tab[element] == true
    else
        for e,_true in pairs(tab) do
            if type(e) == "table" and e:equals(element) then
                return true
            end
        end
        return false
    end
end

Our elements, incoming, are not tables, so it’s not as bad as it might be. Nonetheless, it seems like if we could speed up the hasAt on our CSV records, we might be glad that we had.

Thus my idea:

Assume a Lightbulb Here

The CSVSet has an array of names, the labels of the fields of the CSV file. and it has a rack of lines, of variable length, separated by newlines. When we go to create each record as an XSet, we have a quick gmatch on newline, and then go into our XSet-folderol.

What if, instead, there was a new kind of Set, a CSVLineSet, that could answer the hasAt question rapidly? Might we pick up a bit of time? Perhaps. Might we learn something? Certainly.

The idea is this: convert the array of names, at the time of creation of the CSVSet into a table of name->integer, where the integer is the number of the field. 1, 2, 3, that sort of thing.

Then,, when we pull out the line in CSVSet elements, create, not an XSet, but a CSVLineSet, giving it the label table. And inside the CSVLineSet use gmatch to produce an array of field values. Then hasAt is just a quick couple of fetches from the tables.

Let’s try it.

Let’s TDD the bits.

        _:test("CSVLine label table", function()
            local labels = { "last", "first", "city", "age" }
            local labelTable = CSVSet:makeLabels(labels)
            _:expect(lableTable["last"]).is(1)
            _:expect(lableTable["first"]).is(2)
            _:expect(lableTable["city"]).is(3)
            _:expect(lableTable["age"]).is(4)
        end)

This is almost trivially easy. So tempting to just code it up, but in fact I’m trying to get into the TDD rhythm, and there’s nothing like a trivial test to get started. Besides, I can probably screw this up somehow.

Test will fail looking for the makeLabels.

27: CSVLine label table -- Tests:398: attempt to call a nil value (method 'makeLabels')

And …

function CSVSet:makeLabels(labels)
    local result = {}
    return result
end

Now the test fails the expectations. Wait, what?

27: CSVLine label table -- Tests:399: attempt to index a nil value (global 'lableTable')

Nice spelling, Ron. (I told you I can screw up.)

        _:test("CSVLine label table", function()
            local labels = { "last", "first", "city", "age" }
            local labelTable = CSVSet:makeLabels(labels)
            _:expect(labelTable["last"]).is(1)
            _:expect(labelTable["first"]).is(2)
            _:expect(labelTable["city"]).is(3)
            _:expect(labelTable["age"]).is(4)
        end)

Now:

27: CSVLine label table  -- Actual: nil, Expected: 1

And three more.

function CSVSet:makeLabels(labels)
    local result = {}
    for i,label in ipairs(labels) do
        result[lable] = i
    end
    return result
end

I expect the test to run. My expectations are dashed:

27: CSVLine label table -- CSVset:53: table index is nil

Must be a spelling error. Dammit, I did it again, wrote lable instead of label. Too much typing of tabel. Also intercourse English spelling.

function CSVSet:makeLabels(labels)
    local result = {}
    for i,label in ipairs(labels) do
        result[label] = i
    end
    return result
end

Now it better run. And it does. And I’m rather glad I started with that test.

Now let’s build a CSVLineSet instance and check it.

        _:test("CSVLineSet", function()
            local names = {"last","first","city","state"}
            local data  = '"Jeffries","Ron","Pinckney","MI"\n'
            local csv = CSVSet(names,{})
            local line = csv:createLineSet(data)
            _:expect(line:hasAt("Jeffries","last"))is(true)
            _:expect(line:hasAt("Ron","first"))is(true)
            _:expect(line:hasAt("Pinckney","city"))is(true)
            _:expect(line:hasAt("MI","state"))is(true)
        end)

This is a bit bigger than it should be, I think. But we’ll get there, and if I get in trouble, I’ll write a simpler test.

You may be wondering what my issue is with this one? Well, it seems to be a test just on createLineSet but inside there’s work to be done with the labels and such. I’d prefer only one little step to green, and here we have a few.

I feel up to it. The test will guide me.

28: CSVLineSet -- Tests:409: attempt to call a nil value (method 'createLineSet')

No surprise here.

function CSVSet:createLineSet(line)
    return CSVLineSet(self.nameToIndex, line)
end

Here I’ve posited that we have the name to index table built already. I could go do that now, but the test isn’t going to ask for that, it’s going to ask for CSVLineSet:

28: CSVLineSet -- CSVset:32: attempt to call a nil value (global 'CSVLineSet')

We’ll follow the test, not our nose.

CSVLineSet = class(XSet)

Fail on hasAt, I imagine. Might fail in the init.

This surprised me:

28: CSVLineSet -- Tests:410: attempt to call a nil value (global 'is')

A bug in the test. I left out all the dots in front of is. I’m a bit surprised that compiled. Fixed the dots and:

28: CSVLineSet  -- Actual: false, Expected: true

Interesting. The set ran through the standard inherited XSet code without major error.

This raises an issue for me, namely whether this CSVLine set is robust enough to respond to all the various default methods. We’ll try to remember to come back to that. For now, we better do init and hasAt:

function CSVLineSet:init(labels, line)
    local pat = '"(.-)"[,\n]'
    self.labels = labels
    local fields = {}
    for field in line:gmatch(pat) do
        table.insert(fields,field)
    end
    self.fields = fields
end

function CSVLineSet:hasAt(e,s)
    local index = self.labels[s]
    if index == nil then return false end
    return self.fields[index] == e
end

That’s a lot of code to just be blurting out. Probably something bad will happen. Test.

28: CSVLineSet -- CSVset:99: attempt to index a nil value (field 'labels')

Labels is nil. That’s because nameToIndex here is nil:

function CSVSet:createLineSet(line)
    return CSVLineSet(self.nameToIndex, line)
end

The test told us. But it could have been more clear. Anyway …

function CSVSet:init(labels,data)
    self.labels = labels
    self.nameToIndex = self:makeLabels(labels)
    self.data = data
end

I think I don’t like the naming there.

The tests all run. Commit: CSVLineSet passes hasAt test.

I think the next thing is to try it here:

function CSVSet:elements()
    -- return an iterator for contents
    local pat = ".-\n"
    return coroutine.wrap(function() 
        for line in self.data:gmatch(pat) do
            local converted = self:convertToXSet(line)
            coroutine.yield(converted,NULL)
        end
    end)
end

That becomes:

function CSVSet:elements()
    -- return an iterator for contents
    local pat = ".-\n"
    return coroutine.wrap(function() 
        for line in self.data:gmatch(pat) do
            local converted = self:createLineSet(line)
            coroutine.yield(converted,NULL)
        end
    end)
end

That should result in restrict using our new set. This will be interesting.

24: Time Restricts -- XSet:74: XSet:52: bad argument #1 to 'for iterator' (table expected, got nil)

Interesting message. I’m not sure just what’s up but I suspect it’s that we don’t have elements implemented on our CSVLineSet.

function CSVLineSet:elements()
    return coroutine.wrap(function()
        for scope,index in self.labels do
            yield(self.fields[index], scope)
        end
    end)
end

I really should have written a test for that. Let’s run and see what happens now.

24: Time Restricts -- XSet:74: CSVset:101: attempt to call a table value

Forgot pairs.

function CSVLineSet:elements()
    return coroutine.wrap(function()
        for scope,index in pairs(self.labels) do
            yield(self.fields[index], scope)
        end
    end)
end

This will return the fields of the CSVLineSet in a random order. We might not like that. But tests shouldn’t care. I kind of expect a clean run now. Yeah, right, forgot coroutine:

function CSVLineSet:elements()
    return coroutine.wrap(function()
        for scope,index in pairs(self.labels) do
            coroutine.yield(self.fields[index], scope)
        end
    end)
end

I told me I should have tested that directly. Works. Let’s see some numbers:

The old numbers are:

10 csv 	0.26193499565125
10 long 	0.15184116363525

The new numbers are:

10 csv 	0.26467084884644
10 long 	0.13110280036926

A small but noticeable improvement.

Let’s review what the tests are.

The word “long” should be “base”. The base is faster than it was, and faster than the specialized restrict we wrote. I think we can write off the old csvRestrict entirely.

If I remove the specialized version, some tests fail. I’ll remove those as well.

Commit: new CSVLineSet speeds up restrict a bit. Specialized csvRestrict and tests removed. Timing tests removed.

Let’s sit back and sum up.

RetroSummary

So. That went well and was a small improvement. What are some issues?

Timing Suite: I think we have a case for a fairly comprehensive suite of timing tests for this thing, since part of the Big Story is that by using different representations of the information we can get high performance without changing the basic theory.
Refactoring Needed: The current implementation of the new CSVLineSet is a bit rough, and both it and CSVSet should have a bit of cleanup done to them. This isn’t technical debt: it’s code that’s not as good as it could be. I think we’ll defer this until next time, however, as I’m tired, and we have some other thinking to do.
Approaches: Bill suggested a number of approaches that one could take in building a system like this. A, just let everything call hasAt and elements and leave it up to the compiler. B, hand-code certain overrides for efficiency. (That’s what I did with csvRestrict. C, rather than hand-coding, treat it as a refactoring problem, transforming the code based on what the code says and whatever cleverness one might have. D, represent the actions to be performed in some kind of expression tree and write smart transformations on that tree based on the situation and then execute the transformed tree.; C is a good wakeup call for me, because generally speaking I think refactoring is better than rewriting. I suspect that in the current situation you’d do something like bring the old algorithm down to the subclass and refactor it there, leaving the canonical one up top for whatever cases don’t have refactored ones. I’d have to think about that, and Bill may have seen something that I don’t.; D is of course the big win if one can do it. Someone expresses a query that comes down to some set expression and then you plug in substitutions and refactor it with some magic automatic routines and Voila! you have a better way of getting the right answer. We’ll have to think about that. We might be able to at least demonstrate the principle.; Bill correctly observed that the “point” of XST is the ability to describe all the layers in the same (mathematical) language, so it makes some sense that you could come down to different final set expressions depending on the data you have.; More thought needed here, but I’d like to try it.
More Set Operations: I actually did some rough set theoretic calculations in thinking about the CSVLineSet. I’ve been reading the articles I can find on line and discovered some relatively newly defined operations that made some sense to me. I’ll spare you the math, but here are a couple of pages of my scribbles for your amusement.

scribbles 1

: I don’t claim that those scribbles quite span the gap between my CSVLineSet and pure mathematics, but they are close. In particular, the rho operator (looks like a P in my scribbles) is an interesting one that, given a scope, returns the unique element at that scope if there is one. Seemed like what I was thinking of doing with the CSVLineSet.
Better Base Set Structure: The structure of the CSVLineSet is quite nice if you happen to have named elements, one for each name, either a tuple or whatever you’d call the condition that

S:hasAt(x,y) and S:hasAt(z,y) iff x==z

: This makes me want to make that the assumed structure for every set, and then somehow to morph to something more complex if someone gives you another element under the same scope. The base structure would be even simpler than the CSVLineSet, in that we haven’t hooked the CSVLineSet’s elements to the scopes. Arguably we should do that but the most common usage is only to check one or a few scopes from the many, so it seemed best not to create the full structure.

“Bottom” Line …

A bit of progress, quite some interesting learning, and even more interesting speculation. At least interesting to me. I hope that my reader agrees.

See you next time!