I was puzzling over an issue with ‘union’ and gained an insight that either I’ve never had, or that I had lost. Whee!

It was like this. I was thinking about the union set operation. As you probably know, the union of A and B is the set containing all the elements of A and all the elements of B, and no other elements. But there’s an issue. Given

A = {a,b,c}
B = {c,d,e}

The union is, of course {a,b,c,d,e}. So, it appears, we can’t just append the two sets together, we have to do something about that potentially duplicate ‘c’. Or do we? In set theory:

{a,b,c,d,e} = {a,b,c,c,d,e}

It isn’t that there are no duplicates in sets. Formally speaking, it is that we can’t tell if there are duplicates or not, because the only question we can ask of a set is “is c an element of you”, and it can only answer Yes or No, not Hell Yes!

You may remember in an earlier article, I mentioned the need to remove duplicates, and that we might have to “weed” the set to remove them. I think we all “know what we mean” by that. But formally, in set theory, there is no universally applicable way to say that.

Dave Childs often wrote of two or more models in a set theoretic information system: an information model, and a storage model. (He also wrote of a disk model, for that matter.) Dave, of course, insisted that both the information model and the storage model should be set-theoretic in nature. If they are, we get all the advantages of set theory –whatever those are–at all the various modeling levels.

Now what we have today, in our little XST program, is an information model that is, I think, set theoretic in nature. (There are some hidden assumptions about what kind of sets we can build that will work, and what kind may not work. I’m not sure whether the current operations will work well on highly nested sets.)

Be that as it may, our storage model is clearly not set theoretic in nature. Our storage model is a Lua table of scopes, with each scope containing a table where the key is an element and the value is true. We use this to try to ensure that we can’t ever build a set that looks like {a,b,c,c,d,e}, because that structure won’t allow two c to exist. There either is one, or there isn’t.

However, if we try to use that structure to contain sets, the duplicate suppression won’t work, because two copies of the set {a,b,c] are two separate XSets and therefore they have different keys, and therefore there could be two of them in there.

I believe that I’ve correctly written the operations to deal with that, because they ask questions like “does there exists a set equal to this one”, and we deal with that explicitly:

Well. On second thought: I went to look at some code to show at this point and found this:

function XSet:union(anXSet)
    local result = XSet()
    local f = function(element, scope) result:addAt(element,scope) end
    self:iterate(f)
    anXSet:iterate(f)
    return result:lock()
end

That rather clearly just adds all the elements of each, unless addAt is helping us out.

function XSet:addAt(element, scope)
    if self.locked then error("set is locked") end
    if not self.contents[scope] then
        self.contents[scope] = {}
    end
    self.contents[scope][element] = true
    return self
end

If ‘element` is a set, then two sets of equal contents, which are not the same set, should both go in there. Let’s test that.

        _:test("Curly Union", function()
            local ron = XSet()
            ron:addAt("Jeffries","last")
            ron:addAt("Ron","first")
            ron:addAt("Pinckney","city")
            local ron2 = XSet()
            ron2:addAt("Jeffries","last")
            ron2:addAt("Ron","first")
            ron2:addAt("Pinckney","city")
            local recs1 = XSet():addAt(ron,NULL)
            local recs2 = XSet():addAt(ron2,NULL)
            local un = recs1:union(recs2)
            _:expect(un:card()).is(1)
        end)

Here we create two records that are equal. Better compare them to show that:

        _:test("Curly Union", function()
            local ron = XSet()
            ron:addAt("Jeffries","last")
            ron:addAt("Ron","first")
            ron:addAt("Pinckney","city")
            local ron2 = XSet()
            ron2:addAt("Jeffries","last")
            ron2:addAt("Ron","first")
            ron2:addAt("Pinckney","city")
            _:expect(ron:equals(ron1),"records equal").is(true)
            local recs1 = XSet():addAt(ron,NULL)
            local recs2 = XSet():addAt(ron2,NULL)
            _:expect(recs1:equals(recs2),"sets equal").is(true)
            local un = recs1:union(recs2)
            _:expect(un:card()).is(1)
        end)

Here I’m checking the two records for equality and then the two record sets. (I could use better names here.) I get a surprise failure. I expected the calls to equals to work and the call to card to fail. But no:

7: Curly Union -- Tests:290: attempt to index a nil value (upvalue 'other')

Oh. Whew. The bad naming bit me. Fix the names.

        _:test("Curly Union", function()
            local ron1 = XSet()
            ron1:addAt("Jeffries","last")
            ron1:addAt("Ron","first")
            ron1:addAt("Pinckney","city")
            local ron2 = XSet()
            ron2:addAt("Jeffries","last")
            ron2:addAt("Ron","first")
            ron2:addAt("Pinckney","city")
            _:expect(ron1:equals(ron2),"records equal").is(true)
            local people1 = XSet():addAt(ron1,NULL)
            local people2 = XSet():addAt(ron2,NULL)
            _:expect(people1:equals(people2),"sets equal").is(true)
            local un = people1:union(people2)
            _:expect(un:card()).is(1)
        end)

Now I get what I expected to get:

7: Curly Union  -- Actual: 2, Expected: 1

The sets compare equal but their union shows cardinality of two. This could be seen as a bug in cardinality, if we wanted to allow duplicates to build up. Or it could be seen as a bug in union. I want to print the set, just to see what is in it, so I add this to the end of the test:

            for rec,_null in un:elements() do
                for e,s in rec:elements() do
                    print(e.."@"..s)
                end
            end

Sure enough, it prints my record twice. I’m sorry, I’m old, I like to be sure sometimes.

I think what I’d like to do now is to fix union. We might try this:

function XSet:union(anXSet)
    local result = XSet()
    local add = function(element, scope) result:addAt(element,scope) end
    self:iterate(add)
    local addIfNew = function(element, scope)
        if result:hasntAt(element,scope) then
            result:addAt(element,scope)
        end
    end
    anXSet:iterate(addIfNew)
    return result:lock()
end

I’ve created a new function addIfNew that only adds an element if the element isn’t already in there. If hasntAt works correctly, this should fix our test. And it does.

Now we really ought to think about what we have wrought here.

What Hath Ron Wrought?

The function addAt adds unconditionally, because when we started out, we knew that our chosen storage structure was going to deal with duplicates. (The element->true trick.) But then we got elements that are class instances, which look like tables to Codea, and which all hash to different values. That means that to Codea ron1 and ron2 have different hashes so the storage is perfectly happy to have them both.

As things stand, I’ve compensated for duplicate values in union. That means that every time duplicates are possible, we’ll need to compensate. That way lies error, and probably dragons. Surely I’ll forget.

One possibility is “just” to change addAt to check what it’s adding, and not add it if it’s already there. We know that hasAt does the right thing. Our test just now shows that it does but to tell the truth I also read it:

function XSet:hasAt(element, scope)
    local tab = self.contents[scope]
    if not tab then return false end
    if type(element) ~= "table" then
        return tab[element] == true
    else
        for e,_true in pairs(tab) do
            if type(e) == "table" and e:equals(element) then
                return true
            end
        end
        return false
    end
end

So it might be better to require addAt never to add a duplicate. We’d do this:

function XSet:addAt(element, scope)
    if self.locked then error("set is locked") end
    if self:hasAt(element,scope) then return end
    if not self.contents[scope] then
        self.contents[scope] = {}
    end
    self.contents[scope][element] = true
    return self
end

That runs all the tests, no surprise. Now I should be able to change union back:

function XSet:union(anXSet)
    local result = XSet()
    local add = function(element, scope) result:addAt(element,scope) end
    self:iterate(add)
    anXSet:iterate(add)
    return result:lock()
end

Tests all run. Commit: addAt checks for presence before adding, fixing problem where union created duplicate entries.

Looking Back

Reflecting, I’m not where I expected to be. I think we can draw a big lesson first:

No matter how carefully we estimate and predict what we’re about to do, the things we encounter along the way, in the code, will always interfere with our plans.

You can take this as an argument against fine-grain estimation if you wish, and I won’t object. You can even recursively take it as an argument against all estimation and I still won’t demur. I favor budgeting and delivering the best possible product every day until the budget runs out. How you get the budget without estimating something, I’ll leave to you.

Anyway: expect surprises is good advice, except that the surprises are still surprises. Recursion again. But I digress.

Let’s look at what has happened in terms of our information model and our storage model. Our information model is trying to be legitimately set-theoretic, at least subject to some unstated assumptions about what we can put into it. To that end, we’ve found it necessary to put some specialized code into our implementation (our storage model) to ensure that sets don’t get duplicate entries in storage, because that confuses the information model, resulting in things like sets having size 2 instead of the size 1 we expected.

That’s somewhat irritating, though I believe we’ve got the change in a safe place now. It should be impossible to build a set with “duplicate” records now. However, that comes at a cost, which is that when we build a large set of records, we’ll be searching the output set again and again to see whether it already contains an element equal to the element we’re trying to add.

This means that the cost of adding an element has gone from O(1) to O(card(set)) when we’re adding to a set of records. The cost gets worse and worse as the data grows, proportionately to the size of the data. That is not a good thing.

Consider union. The function we wrote adds all the records from the first input to the output, and then adds all the records from the second. It’s likely that the first set is large and the second is small–we’re probably adding a new employee to our existing 10,000 employee database. But the addAt function is going to check each one of the 10,000 to make sure it’s not already in the output.

We can easily go back to my first scheme, probably by providing both addAt and addIfAbsentAt, and using the first one judiciously. We might name it addAtUnsafely or something, in hopes of using it right. If performance got to be an issue, we might do that … but that is not the way.

What is the way? Honestly, I don’t exactly know, but I suspect that one good way is to have a storage model that better supports the information model. Let’s imagine one.

Suppose we chose a sort of CSV model for our records, where we just had the string

"Jeffries@last,Ron@first,Pinckney@City"

And where we (somehow) ensured that the fields were always in the same order. Then at the storage level, we could check whether our record r is in the set with a single call to string:find. It would still be O(N), but it would be a very fast O(N), because it would come down to a memcmp somewhere inside the system. It would be one or two orders of magnitude faster than checking the actual records in a loop.

We’d perhaps go so far as to include nesting:

"{
{Jeffries@last,Ron@first,Pinckney@City}@NULL
{Hendrickson@last,Chet@first,Atlanta@City}@NULL
}"

The point is that where now we loop over all the records, we could translate

aSet:hasAt(e,s)

With its long loop into a single

str = "{"..e:toCSV().."}@NULL"
return data:find(str)

And that would fly.

We can do that in any case, set theory or not. If our storage model is set theoretic at base, Dave would argue, and I would agree, things would probably be better. For example, we could implement more than just addAt and hasAt. We could implement a specialized byte-scanning version of intersect or union or restrict, set-theoretically, and then those operations would also execute at speed.

I used to refer to operations like that as “byte movers”. I think it’s clear that if we can translate our higher-level notions of set operations into lower level byte movers, we can get very high performance, just because those operations are likely 100 or more times faster than reconstituting all the records and checking them for the right elements and scopes.

The set theory would just help us get it right and to be certain that it all did the right thing.

But It’s Hard, Ain’t It Hard

Yes it is quite difficult. In particular, we have all the same ideas we had before, indexes and memory compares and hash tables, but now we also need to think in terms of formal mathematical operations and Dammit Janet, why can’t we just go ahead and code this thing up?

We always can. Note that that’s what we have right now. We have a set-theoretic topping on an ad-hoc storage model. But we can also see the need for a set theoretic storage model in the code we have.

Check out what we just wrote:

function XSet:addAt(element, scope)
    if self.locked then error("set is locked") end
    if self:hasAt(element,scope) then return end
    if not self.contents[scope] then
        self.contents[scope] = {}
    end
    self.contents[scope][element] = true
    return self
end

function XSet:hasAt(element, scope)
    local tab = self.contents[scope]
    if not tab then return false end
    if type(element) ~= "table" then
        return tab[element] == true
    else
        for e,_true in pairs(tab) do
            if type(e) == "table" and e:equals(element) then
                return true
            end
        end
        return false
    end
end

See that loop inside hasAt? That’s really nothing more or less than “there exists an element e that equals(element)”. We can’t code it using our current “exists” because the table tab isn’t a set. And we don’t have an exists method on tables. We could have, in principle. Maybe we should have.

But if instead of a table we had an s_set (a storage set), that s_set could have all kinds of nice operations attached.

So what we’re feeling is a bit of a call for a smarter storage model, one that can save us from this expensive O(N) record insertion, even if all it does is grant us a faster O(N), and ideally get us down to O(logN) or even O(1).

Will we get to a set-theoretic information model and a set-theoretic storage model? I don’t know yet. I feel that I have a clearer understanding of the desirability than I’ve had for years. That may or may not mean that I have enough understanding and enough gumption to actually do it.

We’ll find out …