Time to work on the actual restrict operator for CSV, since the pattern-maker experiment was a success.

I was doing some reading in XST last night, as one does, and learned that in some of Dave Childs’ more recent writings, he’s been thinking more about tuples and scope transformations. I rather wish that had been more explicit back when I was working with the theory.

The basic idea is simple. It has always bee implicit that, often, the “scope” part of XST is used for naming, while the “element” part is used as the data. Thus we say things like:

people:addAt("Jeffries", "last")

That adds a element “Jeffries” to the set people, with scope “last”, and I hope it’s clear which we mean to be the data and which the field name. The same is true for tuples. We could imagine the string “Jeffries” as a tuple:

<J,e,f,f,r,i,e,s>

That’s exactly the same, by definition, as the set

{J1, e2, f3, f4, r5, i6, e7, s8}

(I hope I don’t have to type many more lines like that one.)

Childs defines some operations on the scopes of sets, such as

X/s/

Which, given s, say,

{ lastlast_name, firstfirst_name }

and applied to my people sets, would rename any “last” elements to “last_name”, and “first” to “first_name”. And the operator

X\s\

Would do the reverse naming. The one operation treats the element as the old name and the scope as the new name, and the other the reverse.

I don’t know that we’ll ever implement those operators, although we might. But let’s think about what that operator, using integer values could do to a string (tuple) representing a record.

Suppose in our person records, bytes 61-65 are the Zip Code. and suppose we want to find Zip 48169, using a restrict kind of operator. The input Zip can be seen as

<4,8,1,6,9>

or

{41, 82, 13, 64, 95}

If we restrict that against our person set, it won’t match anything except any people whose last name starts 48169. We need to “move it over”. So we can apply our scope transformation:

st = { 161, 262, 363, 464, 565}

And now the restrict is lined up with the person records and will select all the Zip 48169 people.

Now, very likely, this is just math, and we would think about how to actually do the work in code, but now we have a very solid spec that says something like:

Given a person record with field “zip” at characters 61-65, and given a search value of “48169” in memory at x, we want to compare the byte at the element part of st in the search value with the scope part of st in the person value.

In other words, compare bytes 1-5 of zip with bytes 61-65 of person.

In other words,

memcmp(input, rec+61, 5)

Now that’s a long way to go to do that. Many readers, and this author, were surely thinking “just compare the input field with bytes 61-65 of the record and we can go for coffee”. But if we were all chock full of set theory, we’d have a few advantages:

  • We have a solid theoretic reason why that’s the right thing to do;
  • We have a formal mechanism for doing it that, to an accomplished reader, is clearly true;
  • We have defined an operation that computes the inputs to a very fast operation in a general way.

Now that’s nothing we couldn’t have done before. What XST offers is, at best, a more formal, provable, consistent way of defining that operation and all the operations we might ever want to do.

Whether that is something worth buying is up to the buyer. My own view is that if it were more accessible, it would be of great value to the few people in the world who are writing programs like Oracle or NoSQL. And, since it isn’t very accessible, it’s not even worth that much to them.

I don’t know whether XST can be made accessible to the experts who would benefit from it. I do know that to date, it has generally not been. It’s at least as complex and complicated as programming, it’s nearly impossible to type correctly, and I know a lot of very good programmers who are not particularly good at abstract math.

But it is nearly accessible to me, and it’s fun to program and explore, and in particular, discovering these tuple and scope operations makes me feel good, because a) I’ve always felt that “byte-movers” were the right lower-level target for set operations and b) it is a decent match for what we’re doing here.

So let’s get to it.

Moving Toward Restrict

I chose the more vague title for this section because my experience is that these things take longer than I expect, and perhaps setting my expectations lower will help. We can hardly fail to move toward restrict. I think we’ll get there today or next time.

We have our new function for creating a match pattern that lines up a string match with our CSV records:

function matchPattern(set,names)
    local fld = '".-"'
    local sep = '[,\n]'
    local matches = {}
    for e,s in set:elements() do
        matches[s] = '"'..e..'"'
    end
    local result = ""
    for _i,name in ipairs(names) do
        result = result..(matches[name] or fld)..sep
    end
    return result
end

That pattern, derived at great personal cost and anguish, selects either any quoted string and separator, using

'".-"[,\n]'

Or, where we have data to match, something like

`"MI"[,\n]'

I’ve tested my hand-crafted pattern for matching as I intend, but let’s extend the test that creates the string from the tables to make sure it can match a record:

        _:test("restrict match pattern", function()
            local result
            local fld = '".-"'
            local sep = '[,\n]'
            local correctPat = fld..sep..fld..sep..'"Pinckney"'..sep..fld..sep
            local names = {"last","first","city","state"}
            local set = XSet()
            set:addAt("Pinckney", "city")
            local testPat = matchPattern(set, names)
            _:expect(testPat).is(correctPat)
            local data    = '"Jeffries","Ron","Pinckney","MI"\n'
            local databad = '"Jeffries","Ron","Punknord","MI"\n'
            result = data:match(testPat)
            _:expect(result).is(data)
            result = databad:match(testPat)
            _:expect(result).is(nil)
        end)

Test passes. Commit: matchPattern output works as record matcher.

Let’s write a restrict test to lead us to get things right. I’ll use the one we have, that uses the old restrict, which converts our lines into XSets, and change it to use a new restrict operator:

        _:test("CSV restrict operator", function()
            local csv = CSVSet(CSVnames, CSVminidata)
            local brighton = XSet()
            brighton:addAt("Brighton", "city")
            local restrictor = XSet()
            restrictor:addAt(brighton,NULL)
            local restricted = csv:csvRestrict(restrictor)
            _:expect(restricted:card()).is(1)
            local record = NULL
            for e,s in restricted:elements() do
                record = e
            end
            _:expect(record:hasAt("Josephine","first_name")).is(true)
        end)

Note that I’m calling csvRestrict above. That doesn’t exist. We’re here to write it. Run the test expecting to find that it doesn’t exist.

21: CSV restrict operator -- Tests:291: attempt to call a nil value (method 'csvRestrict')

I love it when a plan comes together. Code:

function CSVSet:csvRestrict(anXSet)
    local result = XSet()
    for b,_null in anXSet:elements() do
        self:restrict1(b, result)
    end
    return result
end

We might call this pattern “putting off the inevitable” but I think of it as “programming by intention”. My intention is that restrictOne is a method that restricts our data against one record from the restricting set, accumulating results into the result. (And I intend the result to be an XSet. I think we have the technology to accomplish that, else, we’ll write it.)

Test will fail looking for restrictOne, I imagine.

21: CSV restrict operator -- CSVset:53: attempt to call a nil value (method 'restrictOne')
function CSVSet:restrictOne(b, result)
    local pattern = matchPattern(b, self.labels)
    for line in self.data:gmatch(pattern) do
        local converted = self:convertToXSet(line)
        result:addAt(converted,NULL)
    end
end

Huh. I rather think that should work. I’m not at all confident, but I don’t see what I’m missing. Test to find out.

21: CSV restrict operator -- Tests:331: nil scope Josephine
function XSet:addAt(element, scope)
    if self.locked then error("set is locked") end
    assert(scope~=nil,"nil scope "..tostring(element))
    if self:hasAt(element,scope) then return end
    if not self.contents[scope] then
        self.contents[scope] = {}
    end
    self.contents[scope][element] = true
    return self
end

That’s the assert above. It says we’re adding Josephine with a nil scope. It’s happening in the convertToXSet, I think.

function CSVSet:convertToXSet(line)
    local result = XSet()
    local nameCount = 1
    pat = '"(.-)"[,\n]'
    for unquoted in line:gmatch(pat) do
        local label = self.labels[nameCount]
        result:addAt(unquoted, label)
        nameCount = nameCount + 1
    end
    return result:lock()
end

That looks OK to me unless my labels are messed up somehow. I’m going to resort to some printing.

OW. I think what I’m seeing is that we’ve returned not just Josephine’s line, but the line before it. As if our pattern is matching more lines than I intended. I don’t have a test for that.

Grr. Yes, it did. I need a stronger pattern. Write a test. But first a tiny break.

Break

This kind of setback is irritating and can cause stress. I just wanted to lift my head up, acknowledge that, and get back to it:

Stronger Test

        _:test("Restrict match multiple records", function()
            local result
            local names = {"last","first","city","state"}
            local set = XSet()
            set:addAt("Pinckney", "city")
            local testPat = matchPattern(set, names)
            local data    = '"Jeffries","Ron","Pinckney","MI"\n'
            local databad = '"BadLast","Bad","Punknord","NO"\n"Jeffries","Ron","Pinckney","MI"\n'
            result = databad:match(testPat)
            _:expect(result).is(data)
        end)

Here I have two lines of input one that is not acceptable to my intended pattern, and one that is. If the error is as I expect, we’ll return both. If things work, just the latter.

As I expected (and feared)

21: Restrict match multiple records  -- Actual: "BadLast","Bad","Punknord","NO"
"Jeffries","Ron","Pinckney","MI"
, Expected: "Jeffries","Ron","Pinckney","MI"

OK, now I have to figure out … oh I know. I’m using as a separator [,\n] for each, because I didn’t want to know which one was last. That allows any of the field matchers, if all else fails, to expand over multiple lines.

Will it suffice to make them all commas except the last one? No, I think not. However, if instead of . in the pattern, what if we used a pattern meaning … hmm … I can’t say “not a comma” because there can be a comma in the company name. How about “not a double quote”? Let’s try that.

function matchPattern(set,names)
    local fld = '".-"'
    local sep = '[,\n]'
    local matches = {}
    for e,s in set:elements() do
        matches[s] = '"'..e..'"'
    end
    local result = ""
    for _i,name in ipairs(names) do
        result = result..(matches[name] or fld)..sep
    end
    return result
end

Let’s try this:

function matchPattern(set,names)
    local fld = '"[^"]-"'
    local sep = '[,\n]'
    local matches = {}
    for e,s in set:elements() do
        matches[s] = '"'..e..'"'
    end
    local result = ""
    for _i,name in ipairs(names) do
        result = result..(matches[name] or fld)..sep
    end
    return result
end

That definition of fld means

a quote followed by any number of non-quotes, non greedy, followed by a quote.

That pattern satisfies all my tests. Including this one:

        _:test("CSV restrict operator", function()
            local csv = CSVSet(CSVnames, CSVminidata)
            local brighton = XSet()
            brighton:addAt("Brighton", "city")
            local restrictor = XSet()
            restrictor:addAt(brighton,NULL)
            local restricted = csv:csvRestrict(restrictor)
            _:expect(restricted:card()).is(1)
            local record = NULL
            for e,s in restricted:elements() do
                record = e
            end
            _:expect(record:hasAt("Josephine","first_name")).is(true)
        end)

In short, the restrict worked. YAY!!! Let’s try the big set now.

        _:test("CSV large restrict operator", function()
            local csv = CSVSet(CSVnames, CSVdata)
            local mi = XSet()
            mi:addAt("MI", "state")
            local restrictor = XSet()
            restrictor:addAt(mi,NULL)
            local restricted = csv:csvRestrict(restrictor)
            _:expect(restricted:card()).is(14)
            local record = NULL
            for e,s in restricted:elements() do
                record = e
            end
            print("restrict", record:display())
            _:expect(record:hasAt("Josephine","first_name")).is(true)
        end)

The test nearly runs. We get the correct cardinality, 14. But sometimes we find Josephine, sometimes we do not. Why? Because XSets are not in any guaranteed order. We need to improve the test a bit.

        _:test("CSV large restrict operator", function()
            local csv = CSVSet(CSVnames, CSVdata)
            local mi = XSet()
            mi:addAt("MI", "state")
            local restrictor = XSet()
            restrictor:addAt(mi,NULL)
            local restricted = csv:csvRestrict(restrictor)
            _:expect(restricted:card()).is(14)
            local foundJosephine = false
            for e,s in restricted:elements() do
                if e:hasAt("Josephine","first_name") then
                    foundJosephine = true
                end
            end
            _:expect(foundJosephine, "no Josephine record").is(true)
        end)

The test runs. the csvRestrict function is working!

Commit: csvRestrict works by pattern match against the whole CSVSet.

We have some cleaning up to do, but let’s do some timing first.

Dave1707 kindly told me how to do timing that’s closer than a second.

        _:test("Time Restricts", function()
            local s = require("socket")
            local csv = CSVSet(CSVnames, CSVdata)
            local mi = XSet()
            mi:addAt("MI", "state")
            local restrictor = XSet()
            restrictor:addAt(mi,NULL)
            local st = s:gettime()
            local n = 1000
            for i = 1,n do
                csv:csvRestrict(restrictor)
            end
            local et = s:gettime()
            print(n.." csv ", et-st)
            st = s:gettime()
            for i = 1,n do
                csv:restrict(restrictor)
            end
            et = s:gettime()
            print(n.." long ", et-st)
        end)

The results are consistent, and not as good as I had anticipated. The new restrict with the amazing pattern matching takes about 5.7 seconds to run 1000 times, and the older one that reads the individual lines and converts them on the fly to XSets, takes about 7.5.

That’s fascinating, let’s think about it. Commit: csvRestrict works on large CSVdata set.

Reflection

I was really expecting a much greater difference. On reflection, however, the two approaches aren’t so very different.

The short csvRestrict form runs one gmatch over the whole set, returning an iterator that finds 14 lines. Each line is then converted into an XSet record, and stored in the result.

The long restrict runs one gmatch over the whole set, returning at iterator that will return all 500 lines. Each of the 500 lines is coverted to an XSet, checked, and saved or not saved.

So it seems like the major difference between the two cases is the time it takes to convert 500 lines to XSets vs only 14. For my sanity, I’ll time that as well.

        _:test("Time CSV conversion", function()
            local csv = CSVSet(CSVnames, CSVoneLine)
            local s = require("socket")
            local st = s:gettime()
            local n = 100
            for i = 1,n do
                csv:convertToXSet(CSVoneLine)
            end
            local et = s:gettime()
            print(n.." convert ", et-st)
        end)

This surprises me again:

100 csv 	0.57949709892273
100 long 	0.75652408599854
100 convert 	0.00072002410888672

There goes my sanity. The conversion time from line to XSet, if this test is correct, is trivial compared to the difference between the two cases.

No, wait. I’m only comparing 100 record conversions. I should be comparing 100 times 500 conversions, because that’s how many get done in the long restrict. Here’s a better test:

        _:test("Time CSV conversion", function()
            local csv = CSVSet(CSVnames, CSVoneLine)
            local s = require("socket")
            local st = s:gettime()
            local n = 100
            for i = 1,500*n do
                csv:convertToXSet(CSVoneLine)
            end
            local et = s:gettime()
            print(n.."*500 convert ", et-st)
        end)

That provides a time that’s closer to explaining the difference:

100 csv 	0.57564401626587
100 long 	0.76237106323242
100*500 convert 	0.37060689926147

OK, that’s in the ballpark of explaining the difference. It’s too large so there is some overhead in the fast version that we’re not accounting for, probably the slower match of our complex pattern.

We could do more profiling if we wished. I freely grant that I had expected a much greater difference between the two approaches, and it might be of value to dig into the performance question more deeply. Certainly if this were a real information system we’re building here, we’d want to look at the situation to see whether we can devise something that’s faster. As it stands, a 25 percent improvement is OK but not terribly impressive.

Of course what we set out to accomplish was something different. We set out to demonstrate that our same set operations could be implemented against more than one data structure, supporting the XST claim that it can be used uniformly over many data structures. Two isn’t many, but it’s a lot more than one, and I can see, and hope that you can see, that we could do similar things for other formats as well.

Commit: Some timing tests.

Let’s sum up. I’m tired.

Summary

We have a specialized restrict, csvRestrict implemented on our CSVSet class.

(We can make that the real version used either by renaming it, or adding a method restrict to CSVSet that calls it. The latter would leave it available for direct testing. We might even want to adopt an approach like that always, to expose our base methods for test access while using the public names for operational work.)

The csvRestrict uses an entirely different algorithm than our base one, and it is about 25% faster for the common case of one element restricted against. We should test it for more than one, because we know that the csv version will run two gmatch operations against the csv data if there are two records in the matching set. That could change the odds.

I had higher hopes for the performance of this version, based on my thinking that the pattern matching approach would be faster. Of course if the pattern matching library is written in Lua, it might not be all that much faster than other Lua code. Be that as it may, I am surprised and a bit disappointed in the performance.

But I am delighted by the fact that we have two different data structures running in the same information system, and that the code we write to manipulate them can be the same.

That’s pretty cool.

Next time, I think we’ll try a multi-record restricting set, and then look a bit further out. We might:

  • Write one or more new set operations;
  • Think of another data format to handle;
  • Do some entirely different other thing.

Honestly, I don’t know. It’ll be almost as much a surprise to me as it will to you.

See you then, I hope.