XST 13: Ideas Large and Small

Bill Wake was trying to hammer an idea into my head. I must think about that. And I have a small idea of my own.

In a series of exchanges on The Twitter yesterday, Bill Wake was trying to explain what he had thought I would do about CSV restrict, as opposed to what I actually did. I’m sure that I didn’t understand what he was at, but this morning I think I hit the trigger word that started to clue me in. So today, after I write this article, I’ll think about that, and see if I can understand his idea and try it.

But today, I had an idea of my own that I want to try.

The current scheme for CSV restrict is that I have a wild string pattern that does a string match to see if the desired values are in the desired fields. In essence, this avoids unpacking the CSV records into XST records. The idea was that using gmatch, only matching records would ever be upacked.

That turns out to be faster than unpacking them all, but not much faster, only about 25 percent. So I want to try one, maybe two other approaches.

The current code is this:

function CSVSet:csvRestrict(anXSet)
    local result = XSet()
    for b,_null in anXSet:elements() do
        self:restrictOne(b, result)
    end
    return result
end

function CSVSet:restrictOne(b, result)
    local pattern = matchPattern(b, self.labels)
    for line in self.data:gmatch(pattern) do
        local converted = self:convertToXSet(line)
        result:addAt(converted,NULL)
    end
end

The function restrictOne is matching the pattern against the whole blob of data. What would happen if, instead, we pulled out the individual lines and matched?

function CSVSet:restrictOne(b, result)
    local linePat = ".-\n"
    local pattern = matchPattern(b, self.labels)
    for line in self.data:gmatch(linePat) do
        if line:find(pattern) then
            local converted = self:convertToXSet(line)
            result:addAt(converted,NULL)
        end
    end
end

I think this is equivalent. The tests will tell me, and the timing is what I’m interested in. The prior timing is 0.57something seconds. 0.555. Better, still not great. Revert.

I have one more idea. The pattern I’m using has an optional set for the separators, that can match either a comma or a return. I think that may require the gmatch to consider possible records that span lines. (It occurs to me that in rare cases, we might also find spurious records.)

If the final separator in my pattern was just a newline, the match might fail sooner, and that would be a good thing.

That said, I’m not sure it’s true. I suspect that I need to get all the separators right, the inner ones set to comma and the final one to newline. Let’s try it.

Better Pattern

The good news is that I have a test for that:

        _:test("restrict match pattern", function()
            local result
            local fld = '"[^"]-"'
            local sep = '[,\n]'
            local correctPat = fld..sep..fld..sep..'"Pinckney"'..sep..fld..sep
            local names = {"last","first","city","state"}
            local set = XSet()
            set:addAt("Pinckney", "city")
            local testPat = matchPattern(set, names)
            _:expect(testPat).is(correctPat)
            local data    = '"Jeffries","Ron","Pinckney","MI"\n'
            local databad = '"Jeffries","Ron","Punknord","MI"\n'
            result = data:match(testPat)
            _:expect(result).is(data)
            result = databad:match(testPat)
            _:expect(result).is(nil)
        end)

This is when you’re really glad you have a test. Building up the pattern is tricky, we’re about to make it even more so, and we’re sure to get it wrong a few times. We can change this test and know quickly how we’re doing. Change the test:

        _:test("restrict match pattern", function()
            local result
            local fld = '"[^"]-"'
            local inner = ','
            local outer = '\n'
            local correctPat = fld..inner..fld..inner..'"Pinckney"'..inner..fld..outer
            local names = {"last","first","city","state"}
            local set = XSet()
            set:addAt("Pinckney", "city")
            local testPat = matchPattern(set, names)
            _:expect(testPat).is(correctPat)
            local data    = '"Jeffries","Ron","Pinckney","MI"\n'
            local databad = '"Jeffries","Ron","Punknord","MI"\n'
            result = data:match(testPat)
            _:expect(result).is(data)
            result = databad:match(testPat)
            _:expect(result).is(nil)
        end)

That’ll fail nicely. The message is true but not helpful:

20: restrict match pattern  -- Actual: "[^"]-"[,
]"[^"]-"[,
]"Pinckney"[,
]"[^"]-"[,
], Expected: "[^"]-","[^"]-","Pinckney","[^"]-"

Now to change matchPattern. This is one of those fencepost deals isn’t it? I always get those wrong.

function matchPattern(set,names)
    local fld = '"[^"]-"'
    local sep = '[,\n]'
    local matches = {}
    for e,s in set:elements() do
        matches[s] = '"'..e..'"'
    end
    local result = ""
    for _i,name in ipairs(names) do
        result = result..(matches[name] or fld)..sep
    end
    return result
end

Oh this should be easy. Hold my chai.

function matchPattern(set,names)
    local fld = '"[^"]-"'
    local inner = ','
    local outer = '\n'
    local matches = {}
    for e,s in set:elements() do
        matches[s] = '"'..e..'"'
    end
    local result = ""
    local last = #names
    for i,name in ipairs(names) do
        result = result..(matches[name] or fld)..(i==last and outer) or inner
    end
    return result
end

I am confident in this. Am I right to be? Let’s test and find out.

I was nearly right. Paren in the wrong place:

        result = result..(matches[name] or fld)..(i==last and outer or inner)

Tests all run and the time is now 0.47, down from 0.55. A clear win. I wonder what happens if I put in the other loop as well. First, commit: new matchPattern faster than old.

Now let’s try that other loop. It’s slower now. The reason is, I imagine, that our current match is good at finding individual lines already, so the savings isn’t there any more.

Retrospective

Interesting morning so far. I had an idea that wasn’t worth much, that was supposed to be speeding up the match a bit. It didn’t do much, but that got me thinking about a pattern with fewer options, that couldn’t match so readily across a line boundary, and that offered a decent improvement. Since the pattern is actually simpler, and we have a test to make sure it’s created (and to document it), it’s a very righteous change.

And I am super glad that I had that test. It made me much more certain that I was creating the right pattern, even though the actual code to do it was pretty simple.

The current csv restrict time is about 65 percent of the long form. A savings of 35 percent or so. Not bad. Less than I had hoped for.

Let’s make this version official. We just write:

function CSVSet:restrict(anXSet)
    return self:csvRestrict(anXSet)
end

Now while we’re at it, let’s break out the base one so that we can test it for timing:

function XSet:restrict(B)
    return self:baseRestrict(B)
end

function XSet:baseRestrict(B)
    -- return all elements (a) of self such that
    --  there exists a record (b) in B such that
    --      b:subset(a)
    return self:select(function(a,s)
        return B:exists(function(b,s) 
            return b:isSubset(a) 
        end) 
    end)
end

After an adjustment to the test, I can still get the timings. Commit: made faster csvRestrict the official version.

Now there is something we haven’t tried. We haven’t tried providing more than one restricting record to the function. I am concerned that, because that will do multiple passes over the large CSV set, it may actually be slower than the base. We can find out easily enough.

With two records, my hot version is 7 percent slower. Let’s try with five.

Bummer. My super-fast optimized every so wonderful cvsRestrict is 70 percent more costly if there are five records in the restricting set. The reason of course is the multiple passes over the data.

Because I don’t want to give up this marvelous idea (sunk cost fallacy, anyone?) I do this:

function CSVSet:restrict(anXSet)
    if anXSet:card() == 1 then
        return self:csvRestrict(anXSet)
    else
        return self:baseRestrict(anXSet)
    end
end

Is this a total crock, or just a partial crock? Well, the most common restrict is with just one element in the restrictor, so I think it’s just a partial crock.

Commit: use “fast” cvsRestrict on singleton restrictors.

Let’s sum up.

Summary

The big lesson about optimizing code remains the same: measure, measure, measure. If I had a profiler, I’d profile this, but I don’t have one. There might be one out there but life’s too short. Measuring told me a lot.

My original scheme for a faster restrict was about 25 percent faster on a single restricting record. Today I tried another scheme that was a tiny bit better, but not enough to be worth the complexity. That sad result led me to consider my match pattern and to optimize it so that it is faster to fail. It’s always difficult to guess what a string pattern’s performance will be, but more restrictive is always better. In this case, I got to about 35 percent better than the base restrict. So that was worth doing, since the pattern is simpler, and generating it remained simple as well.

Now I need to go think about and read up on function folding, to see whether I can work out what Bill was thinking. And I’ll probably ask him, as well. That’s usually a good way to find out what people are thinking.

Next time? I don’t know yet. Join me to find out!