XST 15: An experiment. An assessment. A plan, perhaps.
In conversation with Bill Wake and with the Internet, I have an idea for something to try. And I’m just about ready to assess where we are and where we should go.
On a Slack we inhabit, Bill and I tossed around some ideas. Not as fruitful as it might have been in a real chat but some understanding and ideas came out. I’ll come back to some of those below.
One more direct outcome came from the Rubber Duck Theory of Software Development. In explaining some details to Bill, I got an idea for something interesting to try. I subsequently even tried to express the idea in set theory.
Let’s look at the base algorithm for restrict, the one that runs at the XSet level:
function XSet:baseRestrict(B)
-- return all elements (a) of self such that
-- there exists a record (b) in B such that
-- b:subset(a)
return self:select(function(a,s)
return B:exists(function(b,s)
return b:isSubset(a)
end)
end)
end
That’s just about a perfect transcription of the formal definition into Lua. It leads us to ask about isSubset
:
function XSet:isSubset(other)
-- return whether self is a subset of other
-- for every element e,s of self
-- is e element-s of other
return self:every(function(e,s)
return other:hasAt(e,s)
end)
end
So restrict comes down to a whole raft of calls to hasAt
, the method that asks whether a set has a given element at a given scope. If hasAt
is slow, isSubset
is slow and restrict
is slow.
We have a bunch going on in a restrict against a CSVSet, including this:
function CSVSet:elements()
-- return an iterator for contents
local pat = ".-\n"
return coroutine.wrap(function()
for line in self.data:gmatch(pat) do
local converted = self:convertToXSet(line)
coroutine.yield(converted,NULL)
end
end)
end
This code finds a line in the CSV “file”, and converts it to an XSet, and returns it (assuming scope == NULL). The conversion is a bit nasty:
function CSVSet:convertToXSet(line)
local result = XSet()
local nameCount = 1
pat = '"(.-)"[,\n]'
for unquoted in line:gmatch(pat) do
local label = self.labels[nameCount]
result:addAt(unquoted, label)
nameCount = nameCount + 1
end
return result:lock()
end
I suspect, but have not measured, that much of the cost of this operation is in the addAt
, putting the individual items into the result XSet,
function XSet:addAt(element, scope)
if self.locked then error("set is locked") end
assert(scope~=nil,"nil scope "..tostring(element))
if self:hasAt(element,scope) then return end
if not self.contents[scope] then
self.contents[scope] = {}
end
self.contents[scope][element] = true
return self
end
This operation checks each incoming element and scope to see if they are already present in the XSet, to avoid duplicates, which are not a mathematical problem but undesirable from a programming viewpoint. That hasAt
call in there resolves to this:
function XSet:hasAt(element, scope)
local tab = self.contents[scope]
if not tab then return false end
if type(element) ~= "table" then
return tab[element] == true
else
for e,_true in pairs(tab) do
if type(e) == "table" and e:equals(element) then
return true
end
end
return false
end
end
Our elements, incoming, are not tables, so it’s not as bad as it might be. Nonetheless, it seems like if we could speed up the hasAt
on our CSV records, we might be glad that we had.
Thus my idea:
Assume a Lightbulb Here
The CSVSet has an array of names, the labels of the fields of the CSV file. and it has a rack of lines, of variable length, separated by newlines. When we go to create each record as an XSet, we have a quick gmatch on newline, and then go into our XSet-folderol.
What if, instead, there was a new kind of Set, a CSVLineSet, that could answer the hasAt
question rapidly? Might we pick up a bit of time? Perhaps. Might we learn something? Certainly.
The idea is this: convert the array of names, at the time of creation of the CSVSet into a table of name->integer, where the integer is the number of the field. 1, 2, 3, that sort of thing.
Then,, when we pull out the line in CSVSet elements
, create, not an XSet, but a CSVLineSet, giving it the label table. And inside the CSVLineSet use gmatch
to produce an array of field values. Then hasAt
is just a quick couple of fetches from the tables.
Let’s try it.
Let’s TDD the bits.
_:test("CSVLine label table", function()
local labels = { "last", "first", "city", "age" }
local labelTable = CSVSet:makeLabels(labels)
_:expect(lableTable["last"]).is(1)
_:expect(lableTable["first"]).is(2)
_:expect(lableTable["city"]).is(3)
_:expect(lableTable["age"]).is(4)
end)
This is almost trivially easy. So tempting to just code it up, but in fact I’m trying to get into the TDD rhythm, and there’s nothing like a trivial test to get started. Besides, I can probably screw this up somehow.
Test will fail looking for the makeLabels
.
27: CSVLine label table -- Tests:398: attempt to call a nil value (method 'makeLabels')
And …
function CSVSet:makeLabels(labels)
local result = {}
return result
end
Now the test fails the expectations. Wait, what?
27: CSVLine label table -- Tests:399: attempt to index a nil value (global 'lableTable')
Nice spelling, Ron. (I told you I can screw up.)
_:test("CSVLine label table", function()
local labels = { "last", "first", "city", "age" }
local labelTable = CSVSet:makeLabels(labels)
_:expect(labelTable["last"]).is(1)
_:expect(labelTable["first"]).is(2)
_:expect(labelTable["city"]).is(3)
_:expect(labelTable["age"]).is(4)
end)
Now:
27: CSVLine label table -- Actual: nil, Expected: 1
And three more.
function CSVSet:makeLabels(labels)
local result = {}
for i,label in ipairs(labels) do
result[lable] = i
end
return result
end
I expect the test to run. My expectations are dashed:
27: CSVLine label table -- CSVset:53: table index is nil
Must be a spelling error. Dammit, I did it again, wrote lable instead of label. Too much typing of tabel. Also intercourse English spelling.
function CSVSet:makeLabels(labels)
local result = {}
for i,label in ipairs(labels) do
result[label] = i
end
return result
end
Now it better run. And it does. And I’m rather glad I started with that test.
Now let’s build a CSVLineSet instance and check it.
_:test("CSVLineSet", function()
local names = {"last","first","city","state"}
local data = '"Jeffries","Ron","Pinckney","MI"\n'
local csv = CSVSet(names,{})
local line = csv:createLineSet(data)
_:expect(line:hasAt("Jeffries","last"))is(true)
_:expect(line:hasAt("Ron","first"))is(true)
_:expect(line:hasAt("Pinckney","city"))is(true)
_:expect(line:hasAt("MI","state"))is(true)
end)
This is a bit bigger than it should be, I think. But we’ll get there, and if I get in trouble, I’ll write a simpler test.
You may be wondering what my issue is with this one? Well, it seems to be a test just on createLineSet
but inside there’s work to be done with the labels and such. I’d prefer only one little step to green, and here we have a few.
I feel up to it. The test will guide me.
28: CSVLineSet -- Tests:409: attempt to call a nil value (method 'createLineSet')
No surprise here.
function CSVSet:createLineSet(line)
return CSVLineSet(self.nameToIndex, line)
end
Here I’ve posited that we have the name to index table built already. I could go do that now, but the test isn’t going to ask for that, it’s going to ask for CSVLineSet:
28: CSVLineSet -- CSVset:32: attempt to call a nil value (global 'CSVLineSet')
We’ll follow the test, not our nose.
CSVLineSet = class(XSet)
Fail on hasAt
, I imagine. Might fail in the init.
This surprised me:
28: CSVLineSet -- Tests:410: attempt to call a nil value (global 'is')
A bug in the test. I left out all the dots in front of is. I’m a bit surprised that compiled. Fixed the dots and:
28: CSVLineSet -- Actual: false, Expected: true
Interesting. The set ran through the standard inherited XSet code without major error.
This raises an issue for me, namely whether this CSVLine set is robust enough to respond to all the various default methods. We’ll try to remember to come back to that. For now, we better do init
and hasAt
:
function CSVLineSet:init(labels, line)
local pat = '"(.-)"[,\n]'
self.labels = labels
local fields = {}
for field in line:gmatch(pat) do
table.insert(fields,field)
end
self.fields = fields
end
function CSVLineSet:hasAt(e,s)
local index = self.labels[s]
if index == nil then return false end
return self.fields[index] == e
end
That’s a lot of code to just be blurting out. Probably something bad will happen. Test.
28: CSVLineSet -- CSVset:99: attempt to index a nil value (field 'labels')
Labels is nil. That’s because nameToIndex
here is nil:
function CSVSet:createLineSet(line)
return CSVLineSet(self.nameToIndex, line)
end
The test told us. But it could have been more clear. Anyway …
function CSVSet:init(labels,data)
self.labels = labels
self.nameToIndex = self:makeLabels(labels)
self.data = data
end
I think I don’t like the naming there.
The tests all run. Commit: CSVLineSet passes hasAt test.
I think the next thing is to try it here:
function CSVSet:elements()
-- return an iterator for contents
local pat = ".-\n"
return coroutine.wrap(function()
for line in self.data:gmatch(pat) do
local converted = self:convertToXSet(line)
coroutine.yield(converted,NULL)
end
end)
end
That becomes:
function CSVSet:elements()
-- return an iterator for contents
local pat = ".-\n"
return coroutine.wrap(function()
for line in self.data:gmatch(pat) do
local converted = self:createLineSet(line)
coroutine.yield(converted,NULL)
end
end)
end
That should result in restrict using our new set. This will be interesting.
24: Time Restricts -- XSet:74: XSet:52: bad argument #1 to 'for iterator' (table expected, got nil)
Interesting message. I’m not sure just what’s up but I suspect it’s that we don’t have elements
implemented on our CSVLineSet.
function CSVLineSet:elements()
return coroutine.wrap(function()
for scope,index in self.labels do
yield(self.fields[index], scope)
end
end)
end
I really should have written a test for that. Let’s run and see what happens now.
24: Time Restricts -- XSet:74: CSVset:101: attempt to call a table value
Forgot pairs.
function CSVLineSet:elements()
return coroutine.wrap(function()
for scope,index in pairs(self.labels) do
yield(self.fields[index], scope)
end
end)
end
This will return the fields of the CSVLineSet in a random order. We might not like that. But tests shouldn’t care. I kind of expect a clean run now. Yeah, right, forgot coroutine
:
function CSVLineSet:elements()
return coroutine.wrap(function()
for scope,index in pairs(self.labels) do
coroutine.yield(self.fields[index], scope)
end
end)
end
I told me I should have tested that directly. Works. Let’s see some numbers:
The old numbers are:
10 csv 0.26193499565125
10 long 0.15184116363525
The new numbers are:
10 csv 0.26467084884644
10 long 0.13110280036926
A small but noticeable improvement.
Let’s review what the tests are.
The word “long” should be “base”. The base is faster than it was, and faster than the specialized restrict we wrote. I think we can write off the old csvRestrict
entirely.
If I remove the specialized version, some tests fail. I’ll remove those as well.
Commit: new CSVLineSet speeds up restrict a bit. Specialized csvRestrict and tests removed. Timing tests removed.
Let’s sit back and sum up.
RetroSummary
So. That went well and was a small improvement. What are some issues?
- Timing Suite
- I think we have a case for a fairly comprehensive suite of timing tests for this thing, since part of the Big Story is that by using different representations of the information we can get high performance without changing the basic theory.
- Refactoring Needed
- The current implementation of the new CSVLineSet is a bit rough, and both it and CSVSet should have a bit of cleanup done to them. This isn’t technical debt: it’s code that’s not as good as it could be. I think we’ll defer this until next time, however, as I’m tired, and we have some other thinking to do.
- Approaches
- Bill suggested a number of approaches that one could take in building a system like this. A, just let everything call
hasAt
andelements
and leave it up to the compiler. B, hand-code certain overrides for efficiency. (That’s what I did withcsvRestrict
. C, rather than hand-coding, treat it as a refactoring problem, transforming the code based on what the code says and whatever cleverness one might have. D, represent the actions to be performed in some kind of expression tree and write smart transformations on that tree based on the situation and then execute the transformed tree. -
C is a good wakeup call for me, because generally speaking I think refactoring is better than rewriting. I suspect that in the current situation you’d do something like bring the old algorithm down to the subclass and refactor it there, leaving the canonical one up top for whatever cases don’t have refactored ones. I’d have to think about that, and Bill may have seen something that I don’t.
-
D is of course the big win if one can do it. Someone expresses a query that comes down to some set expression and then you plug in substitutions and refactor it with some magic automatic routines and Voila! you have a better way of getting the right answer. We’ll have to think about that. We might be able to at least demonstrate the principle.
-
Bill correctly observed that the “point” of XST is the ability to describe all the layers in the same (mathematical) language, so it makes some sense that you could come down to different final set expressions depending on the data you have.
-
More thought needed here, but I’d like to try it.
- More Set Operations
- I actually did some rough set theoretic calculations in thinking about the CSVLineSet. I’ve been reading the articles I can find on line and discovered some relatively newly defined operations that made some sense to me. I’ll spare you the math, but here are a couple of pages of my scribbles for your amusement.
-
I don’t claim that those scribbles quite span the gap between my CSVLineSet and pure mathematics, but they are close. In particular, the rho operator (looks like a P in my scribbles) is an interesting one that, given a scope, returns the unique element at that scope if there is one. Seemed like what I was thinking of doing with the CSVLineSet.
- Better Base Set Structure
- The structure of the CSVLineSet is quite nice if you happen to have named elements, one for each name, either a tuple or whatever you’d call the condition that
S:hasAt(x,y) and S:hasAt(z,y) iff x==z
: This makes me want to make that the assumed structure for every set, and then somehow to morph to something more complex if someone gives you another element under the same scope. The base structure would be even simpler than the CSVLineSet, in that we haven’t hooked the CSVLineSet’s elements to the scopes. Arguably we should do that but the most common usage is only to check one or a few scopes from the many, so it seemed best not to create the full structure.
“Bottom” Line …
A bit of progress, quite some interesting learning, and even more interesting speculation. At least interesting to me. I hope that my reader agrees.
See you next time!