XST 31: Relearning
I’m on a path to make ordinary tables behave like XSets. But first, I have to figure out how this thing actually works! Much musing, then some code.
I’ve realized that if I want to be able to have a Lua table act like an XSet, it will probably have to appear as the data
element of an XSet, and then behave like any other object that can appear there. At this writing, I’m not sure whether that’s possible. The main reason I’m not sure is that I don’t quite have in my head how this thing works.
So before I begin my morning ablutions, I want to refresh my understanding. So let’s take a look.
I am surprised. I thought that I had implemented a powerful set operation, somewhere around the “restrict” family, that found and used indexes if they exist. That is not the case. I have a lot of progress toward that goal, but diverted over to do summing. The Product Owner part of my head deemed that more important than clever use of indexes.
We do have a couple of set operations that can be implemented in more than one way. There’s card
, that returns the cardinality of a set, i.e. the number of elements in the set.
function XSet:card()
return self.data:card() or self:reduce(0, function(r,e,s) return r+1 end)
end
That calls card() on the data element of the XSet, which does this:
function XData:card()
return nil
end
But XData has a subclass, CSVSet, that does it this way:
function CSVSet:card()
local pat = ".-\n"
local rep = "X"
local nls = self.data:gsub(pat,rep)
return nls:len()
end
The CSVSet’s data is a giant string with newlines. This code converts any string ending in a newline, to an X and then gets the length of the resulting string.
This version is called because CSVSet inherits from XData and therefore can override the XData version above.
CSVSet = class(XData)
However, the more I dig in here, the less I like it, because of this: XData inherits from XSet.
XData = class(XSet)
That seems recursive, and I am reminded that I did get a stack overflow the other day until I overrode some method. What breaks if we make XData not do that?
Eight tests fail, out of forty-eight. Not bad, ship it. (Noooo!)
34: People Tuple is XSet -- Actual: false, Expected: true
34: People Tuple -- Tests:470: attempt to call a nil value (method 'restrict')
5: Grouping -- TestSum:62: attempt to call a nil value (method 'stats')
6: Two Levels of Grouping -- TestSum:72: attempt to call a nil value (method 'stats')
7: No Grouping -- TestSum:84: attempt to call a nil value (method 'stats')
6: indexedRestrict -- XData:66: attempt to call a nil value (method 'scopeRestrict')
2: XSet hierarchy d xset -- Actual: false, Expected: true
2: XSet hierarchy t xset -- Actual: false, Expected: true
The grouping ones are failing because Tuple
no longer understands stats
, because Tuple inherits from XData and no longer understands stats. In XSet, stats is this, just deferring to XStatistics class:
function XSet:stats(controlTable)
local stat = XStatistics(self,controlTable)
return stat:getStatistics()
end
I’m sure we’re going to revert here, so let’s see about implementing that on Tuple, to see if that makes those errors go away.
No. It then fails looking for reduce
. It is trying to run setops on Tuple as if it were an XSet. So this tells me that XData must inherit from XSet, unless we want to duplicate all the set operations on XData. That makes no discernible sense to me.
I must pop up a level and digress.
This Is Not Good
What we see here is not good, not because the code doesn’t work. As originally written, it does work. What is not good is that I don’t understand how it works. I have built a model–a program seen as an assembly of things–that I do not understand.
If I don’t understand it, I’ll have trouble changing it. Now, the program is modular, so by and large, I can in fact change it. But I’ve been working toward a situation where the program can “discover” that it has a helping structure, like an index, and use that index, automatically.
Ideally, that would happen either at a high level that doesn’t yet exist, a level that has an overview of the operations about to be done, and that refactors them, or it would happen at a very low level, ideally just by overriding a method somewhere.
We have the latter case in card
but it only works because XSet knows to do this:
function XSet:card()
return self.data:card() or self:reduce(0, function(r,e,s) return r+1 end)
end
There’s an agreement between XSet and whatever is in its data
member, that the member will understand card
and either return a number or nil.
We could do that everywhere. If there’s a special way to do, say, restrict
, we’d write something like:
function XSet:restrict(B)
return self.data:restrict(B) or self:select(function(a,s)
return B:exists(function(b,_ignored)
return b:isSubset(a)
end)
end)
end
That’s not horrible, and it would become a standard pattern of implementation.
It does mean that anything subject to low-level optimization must be implemented in XSet, in XData (returning nil) and in any class that has a non-standard way of doing the thing.
There is another way and we have the case:
function XSet:selectFieldValue(scope,value, index)
local ix = self:findSelectIndex(scope,value,index)
if ix then
print("used index")
return self:scopeRestrict(ix)
else
return self:select(function(record, recordScope)
return record:hasAt(value,scope)
end)
end
end
This set operation tries to find an index and if it finds it, does the transformation of the operation into the easier form. But this operation is not really solving the problem yet: it has to e passed the index if it is to use it.
OK. Maybe this isn’t as “Not Good” as I initially thought, but it’s still weird. I think …
There Might Be a Design Error Here
The impetus behind the XSet/XData split, if I recall, was to allow the general long-form set operations, like isSubset
and restrict
, to reside in XSet, and to be based (mostly) on the two notions of hasAt
, and elements
.
Then whatever forms our set data might take, they would only have to implement hasAt
and elements
and they’d be good to go. If they wanted to implement other operations, that was an issue yet to be resolved, but we have two partial solutions in place, giving us some confidence that we can do more.
However.
An XSet contains as its data
member, an XSet. I can think of no universe in which a recursive data structure is an easy to understand good idea. The only case I know of offhand is the Smalltalk class hierarchy, which has a patch in it that makes everything work and that leaves almost everyone who looks at it confused.
Today we have five classes that inherit from XData: XExpression, XGeneric, Tuple, CSVSet, and CSVLineSet.
- XData itself is abstract: there are no instances of it.
- XGeneric is our bottom-level general set storage, built of tables.
- Tuple is a special-purpose set indexed by integers.
- XExpression is a start at a smart data set, with a DATA component, and someday but not yet, other components such as indexes.
- CSVSet and CSVLineSet are smart sets that understand CSV data and make it usable as sets.
Again, the motivation for this breakout was to get to a situation where the set operations were in XSet and the data access was in XData subclasses.
I think one mistake is that Tuples are instantiated directly. If they are XData children, XSets should contain Tuples, not be Tuples. As things stand, Tuple has special implementations of card
and elementProject
as well as hasAt
and elements
. We could remove its requirement to inherit from XSet to get the rest of the operations, using one of the approaches I sketched above. Then we’d be left with our other current error list to resolve.
Fixing Tuple would fix five of our eight errors. What are the other three?
The indexedRestrict one is also due to Tuple.
The last two, the xset hierarchy
tests, would no longer be needed: I wrote those when I was debugging this messy hierarchy, to be sure that things were inheriting as I intended.
I’m Glad We Had This Little Chat
OK. I can make this program easier to understand by removing XData’s inheriting from XSet and pushing Tuple back inside where it belongs. That will provide some new understanding of how to do smart sets, because Tuple is a little bit smart.
We’ll have let the doors open for optimization at the data level or above set operations, as well as within set operations, so no harm done.
And … there’s a chance that I’ll be able to understand this thing better.
Oddly, this all came about thinking about how to make Lua tables act like sets. We’ll have a better handle on doing that as well.
We have an odd situation here. The program works: all the tests run. It has been easy to enhance, so far. New set operations have gone in easily, and we have two examples of special set formats, and at least a foothold for a smarter data structure.
However, as I’ve demonstrated to my dismay this morning, I don’t understand the program very well. The “root cause” seems to be the fact that it combines “is_a” set with “has_a” set, instead of separating “is_a” and “has_a” as one was taught to do back in preschool.
So although the program has been easy to change so far, I am discovering that the supposedly simple change I had in mind, to make Lua tables work like sets, is not easy–because I don’t have the design clear in my head, and because with that recursion in there, I probably can’t get it clear in my head.
I’m pretty sure I can make it work. I think it will be tricky but possible. But I don’t know I can make it work and as I think about what the new metatables will have to do, I can’t see it. I’ll have to write it and then make it work.
This is not the way.
Make the Change Easy, Then Make the Easy Change
That heading is something that the great Kent Beck has said. When we undertake a change that seems difficult, we first change the program to make the change easy, and then, now that it’s easy, we do it.
I think we have to do that here. We need to refactor to a better design. One question is “which better design?”, and a prior question probably is “what even would be a better design?”. I’m going to resort to drawing some pictures.
I see two basic ways to go. One of them is a move backward and one is a different kind of move backward. Let me explain. No, there is too much. Let me sum up.
One possibility is to have an abstract superclass, XSet, which implements concrete base implementations of the set operations, relying on hasAt
and elements
. Those two methods will be abstract (and throw an error) in XSet. Then we implement specific storage structures as subclasses of XSet, required to implement hasAt
and elements
. These subclasses, if they happen to know a better way to do any other set operation, just override it.
Some people would tell us that that’s not a good design. I don’t understand their argument well enough to make it here, and therefore can’t refute it. It doesn’t seem bad to me, and if it turned out to be, well, we could change it.
The other possibility is to continue along the current path. I’ll provide details in a moment I think the original rationale for the current design was a desire to separate “set theory” from “storage”, in part because Dave Childs said something decades ago that made me think that was a good idea. (And I think it is, in principle. We could have fast storage-oriented operations and use them in our higher-level operations. Or so it seems.)
I’ll call these two ideas Combined, and Separate, respectively.
In favor of Separate is the fact that we have three special storage formats already, Tuple, CSVSet and CSVLineSet. Tuple relies on inheriting operations from XSet, but that is a glitch due to the fact that I allowed Tuple to show up as an XSet, not just as a storage type inside XSet. With some effort, I can fix that, and move it inside.
If we do Separate, we’re essentially sticking with the current design and fixing a smallish design error where we materialized Tuple erroneously. We’ll be faced with finding ways to implement operations that are specialized to storage types, but we have two or more examples already of how to do it. We can work with those until we decide which we like, decide to allow both, or get a better idea.
In favor of Combined is that it seems inherently simpler: there are just a bunch of different kinds of XSets, and specialized operations are just a matter of overriding an implementation higher up.
In favor of Separate is that has_a
is inherently simpler than is_a
. It requires less thought to unwind a hierarchy when each kind of data set is its own deal. We’d probably keep XData abstract so that the subclasses have to implement hasAt
and elements
.
Also in favor of Separate is that we really just have to fix up Tuple and then we can remove the inheritance of XData from XSet and we’re back to a structure that I can understand.
OK. I’ve almost decided. I think I should be sticking with Separate. It has not been shown to be bad, except for the patch between XData and XSet. We’ll undo that and make everything work again. That’s almost all a matter of fixing Tuple to be a storage type, not a set type.
However, there is a sort of compromise solution. Instead of treating Tuple as a kind of storage, we could make it a subclass of XSet (which we’d probably promote to an abstract class, and have a second subclass for the various data type things. I think that’s just weird.
There is a good argument in favor of making Tuple a subclass of XSet. A Tuple is in fact a kind of Set, with the particular property that its scopes are integers in a compact set starting with 1. {a1,b2,c3} you get the idea.
It might happen that a Tuple has a different storage implementation, but it is also potentially important that it could have different high-level implementations of some set operations, just because it’s a tuple and therefore some things are easier and others we know offhand will return nothing.
I have a bit of a bad reaction to this idea, because it feels kind of hybrid to me. Partly things are broken out by storage, but there’s this other breakout by set theoretic properties.
In addition, it’s more work. I’ll have to make XSet purely abstract and then have two concrete subclasses (so far), Tuple and XSetWithData or something.
Decision For Now
For now, I’ll stick with Separate. The goal is to make XData not inherit from XSet, and to make tuples work as a storage type, keeping them under XData. That will require a new creation method but no big deal.
Let’s just dive in and do it. I’m n a clean commit except that I’ve just changed XData not to inherit:
XData = class()
We have those broken tests, which I’ll put here so that I can easily refer to them.
34: People Tuple is XSet -- Actual: false, Expected: true
34: People Tuple -- Tests:470: attempt to call a nil value (method 'restrict')
5: Grouping -- TestSum:62: attempt to call a nil value (method 'stats')
6: Two Levels of Grouping -- TestSum:72: attempt to call a nil value (method 'stats')
7: No Grouping -- TestSum:84: attempt to call a nil value (method 'stats')
6: indexedRestrict -- XData:66: attempt to call a nil value (method 'scopeRestrict')
2: XSet hierarchy d xset -- Actual: false, Expected: true
2: XSet hierarchy t xset -- Actual: false, Expected: true
The hierarchy test is this and probably should be deleted.
_:test("XSet hierarchy", function()
local save = XData.init
XData.init = function() end
local x = XSet()
local d = XData()
local t = Tuple()
XData.init=save
_:expect(x:is_a(XSet)).is(true)
_:expect(d:is_a(XData)).is(true)
_:expect(d:is_a(XSet),"d xset").is(true)
_:expect(t:is_a(Tuple),"t tuple").is(true)
_:expect(t:is_a(XData),"t data").is(true)
_:expect(t:is_a(XSet),"t xset").is(true)
end)
Either that or I can rewrite it but it’s basically just not something you’d normally test. Still, it expresses something so let’s change it:
_:expect(x:is_a(XSet)).is(true)
_:expect(d:is_a(XData)).is(true)
_:expect(d:is_a(XSet),"d xset").is(false)
_:expect(t:is_a(Tuple),"t tuple").is(true)
_:expect(t:is_a(XData),"t data").is(true)
_:expect(t:is_a(XSet),"t xset").is(false)
Runs now. Two down, six to go. Not very interesting but it’s a start.
Now it gets harder. We want for Tuple to behave like a regular data element in XSet. First let’s find the creators and fix them.
Here’s something I wanted to be able to do:
_:test("Tuple", function()
local set = XSet():addAt("Jeffries","last")
local equ = XSet():addAt("Jeffries","last")
local T = Tuple():add("a"):add("b"):add("c"):add("d"):add(set)
_:expect(T:card()).is(5)
I wanted to be able to just add the elements, instead of using addAt
.
I also wanted this shorthand:
local T = Tuple{"a",x=5, "b",z={},"c","d"}
_:expect(T:card()).is(4)
What that does is just throw away the x and z elements. Not very interesting. Arguably it should have errored, saying “this can’t be a tuple”. I think I was just trying to make writing the tests (and actual user code) easier.
Then there’s this:
local csv = XSet:on(CSVSet(CSVnames, CSVdata))
local PT = Tuple()
for e,s in csv:elements() do
PT:add(e)
end
_:expect(PT:card()).is(500)
Here, I created a 500 element tuple of CSVLineSets, the XSet that is a CSV line (plus its name map).
In some ideal world, that would be done with some kind of set operation like asTuple
, that returns a tuple given a set, by iterating it and replacing the scopes with integers. I am pretty sure that’s not a unique mapping, since the scope to element mapping is not defined, but the result would be one of n factorial possible result tuples. How nice.
I think I’m going to need o devise a new creation interface for the Tuple, but let’s just start out and see what happens.
I’m going to call for XSet:tuple(aTable)
, which will work like Tuple’s creation method does now. And I think what I’ll probably do is create a new class, XTupleData to serve as the data holder. I’ll leave that open until I get a bit further. Let’s change this test:
_:test("Tuple Shorthand", function()
local T = Tuple{"a",x=5, "b",z={},"c","d"}
_:expect(T:card()).is(4)
local c = T:elementProject(3)
_:expect(c).is("c")
_:expect(T:elementProject(4)).is("d")
_:expect(T:hasAt("b",2)).is(true)
_:expect(T:hasAt("c",2)).is(false)
end)
To this:
_:test("Tuple Shorthand", function()
local T = XSet:tuple{"a",x=5, "b",z={},"c","d"}
_:expect(T:card()).is(4)
local c = T:elementProject(3)
_:expect(c).is("c")
_:expect(T:elementProject(4)).is("d")
_:expect(T:hasAt("b",2)).is(true)
_:expect(T:hasAt("c",2)).is(false)
end)
This test is really testing the special elementProject in Tuple. We’ll have to figure that out along the way. For now, let’s just create one. Test should fail looking for tuple method.
33: Tuple Shorthand -- Tests:449: attempt to call a nil value (method 'tuple')
Implement:
function XSet:tuple(tab)
local result = XSet()
result.data = Tuple(tab)
return result
end
I’ll start by creating a Tuple and putting it in an XSet’s data element. Let’s see what the test says now.
That one passed. Now I have this:
34: People Tuple is XSet -- Actual: false, Expected: true
That’s just wrong. What’s the test?
No, Wait, let’s rename Tuple to XTuple. That’ll move us along a bit. Run tests again.
A lot of folks calling Tuple. I’ll fix them all as best I can.
_:test("Tuple", function()
local set = XSet():addAt("Jeffries","last")
local equ = XSet():addAt("Jeffries","last")
local T = XSet:tuple():add("a"):add("b"):add("c"):add("d"):add(set)
_:expect(T:card()).is(5)
...
_:test("People Tuple", function()
local csv = XSet:on(CSVSet(CSVnames, CSVdata))
local PT = XSet:tuple()
for e,s in csv:elements() do
PT:add(e)
end
_:expect(PT:card()).is(500)
...
and so on ...
Now I have two errors looking for ‘add, and a stack trace that I've not looked at yet. I'll replace as many
adds` with direct array input where I can.
_:test("Tuple", function()
local set = XSet():addAt("Jeffries","last")
local equ = XSet():addAt("Jeffries","last")
local T = XSet:tuple{"a","b","c","d",set}
_:expect(T:card()).is(5)
That fixes that test. It actually seems to run. Now the other is harder to do unless I just cheat entirely:
_:test("People Tuple", function()
local csv = XSet:on(CSVSet(CSVnames, CSVdata))
local tab = {}
for e,s in csv:elements() do
table.insert(tab,e)
end
local PT = XSet:tuple(tab)
_:expect(PT:card()).is(500)
OK, for now, I just create a simple table and then pull it in as a tuple. A bit nasty, duplicates effort but it’ll do for now. Test.
I am surprised. The tests all run. We are green. Everything works again.
Of course, none of Tuple’s special methods are being used any more, other than perhaps card.
I’ll put traps in all of them just to see.
function XTuple:add(element)
print("add")
table.insert(self.contents, element)
return self
end
function XTuple:card()
print("card")
return #self.contents
end
function XTuple:elementProject(scope)
print("eP")
return self.contents[scope]
end
We do print “card” a few times. That’s because XSet inquires whether you know how to do it:
function XSet:card()
return self.data:card() or self:reduce(0, function(r,e,s) return r+1 end)
end
We can do the same for elementProject
. Let’s do. It is a multiple-step process:
- Implement the special version on the set with special support, XTuple in this case;
- Implement a version returning nil on XData;
- Call the method on
self.data
and use it if non-nil.
function XTuple:elementProject(scope)
print("eP")
return self.contents[scope]
end
function XData:elementProject()
return nil
end
And finally:
function XSet:elementProject(scope)
local fast = self.data:elementProject(scope)
if fast then return fast end
local found = 0
local result = nil
for e,s in self:elements() do
if s==scope then
result = e
found = found + 1
end
end
return found==1 and result or nil
end
This is a bit nasty because of the if. Refactor:
function XSet:elementProject(scope)
return self.data:elementProject(scope) or self:elementProjectLong(scope)
end
function XSet:elementProjectLong(scope)
local found = 0
local result = nil
for e,s in self:elements() do
if s==scope then
result = e
found = found + 1
end
end
return found==1 and result or nil
end
Nicer. Now to remove those prints.
Tests are green. Commit: XData no longer inherits from XSet. Tuple replaced with XTuple as a data
possibility in XSet.
I’m not fond of the name elementProjectLong. It’s really something like generalElementProject. And as a rule, we’ll probably always have one of those that the method that can be optimized uses as its default. Maybe it should be named default.
Maybe they should all be named defaultXYZ
and used only if data doesn’t understand the method. I’m not sure quite how to do that with readable code and reasonable efficiency. There may be some horrid metatable way to do it.
The general pattern, though is something like this:
function XSet:something(...)
return self.data:something(...) or self:defaultSomething(...)
end
It’s a bit irritating that we can’t just plunk in a faster method on one of our XData subclasses and have it picked up automatically. We have to protect it in XData with a nil, and then call it in the above special way. Three steps that would ideally just be one.
We’ll leave that concern in hopes of a better idea later. For now, we can do what we need to do, and the system is a bit more understandable. That should set us up for “tables as sets” next time.
Let’s sum up.
Summary
We got here because I realized that I didn’t understand how the system worked well enough to see how to plug in tables as a kind of set storage. It wasn’t even clear to me if it was “tables as set storage” or “tables as sets”, or what the difference might be.
Refreshing the code in my mind made it clear to me that having XSet have an element, data
that was also an XSet, adn that inherited from XSet, was Not A Good Thing.
One possibility was to make everything subclasses of XSet, the other was to continue the “set vs data” distinction. I decided to keep that distinction, despite a bit less convenience in implementing specialized set operations. This was not a carefully calculated balance sheet kind of decision. I thought about it, laid out some pros and cons, and then chose a direction.
So far that direction is good. The code is green, it’s certainly less weirdly connected, and if there’s any more complexity it is minor. It has the advantage that I can now say pretty clearly how it works:
XSet implements general set operations on extended sets, which can be stored in a variety of ways. XSet has no subclasses. It has one data element,
data
, which must implement at leasthasAt
andelements
, to answer whether a given element-scope combination is in the set, and to produce all the elements via an iterator.
The
data
element can be any of a number of subclasses of the abstract object XData. Subclasses include XGeneric, XExpression, XTuple, CSVSet and CSVLineSet. (These names do not seem as consistent as they might.)
The data elements implement
hasAt
andelements
, providing basic access to their contents in extended set terms, and can optionally implement specialized versions of set operations, typically for efficiency. Doing so requires cooperation and coordination between the code in XSet, XData, and the specific XData subclass.
I think we’ll find that this simpler design makes it easier to talk about how using tables more directly as XSets might be implemented, and easier to choose among alternative approaches. That remains to be seen, but at least my mind doesn’t get a stack overflow now when I think about it.
To me, it’s clear that it’s not enough that the program seems to work. It needs to be organized so that we can understand it. If not, we will often find that we can’t readily change it.
We’l find out more, next time. I hope you will come along.
Until then, be good to each other.