There’s no way around it, I’ve got to work on the fast restrict today. Might not finish. We’ll see.

My cunning plan for the fast restrict is “just” to create a pattern on the fly that will find matching records directly in the giant string that is my CSV data. There are some issues. One to remember is something that restrict can do that I may not have mentioned. Since the restricting set is a set of records, you can put multiple records in, so you could select for city Atlanta and also city Pinckney. But since sets don’t care what their records are, you could just as well select for city Atlanta and last name Jeffries.

Now the standard expression and implementation for restrict is this:

function XSet:restrict(B)
    -- return all elements (a) of self such that
    --  there exists a record (b) in B such that
    --      b:subset(a)
    return self:select(function(a,s)
        return B:exists(function(b,s) 
            return b:isSubset(a) 
        end) 
    end)
end

Note that this implementation iterates the set being restricted once, and for each record, iterates the restricting set. In the CSV set we’ll want to do the reverse, because we can find all the matching records for a single restricting record in one go. We’ll have the same set-theoretic definition, but a different implementation.

Or we could write a different set-theoretic description, and if we were fanatics, prove that it is equivalent. Something like:

A:restrict(B):
  for every element (b) of B,
  return all elements (a) of A, such that
  B:subset(A)

Hmm. There’s a trap in here. Done this way, there is the possibility of more than one B record selecting the same A record. Searching for city Pinckney or last name Jeffries would find my record twice.

Now set-theoretically, {a,a} == {a}, so we could sort of allow that, but if we did, it could lead to trouble later. We’ll need to keep this in mind for our tests (and implementation).

But we’re not close to that yet. Let’s get started.

The Data

I propose to modify my input data for convenience. I’m going to “require” that our CSV input files have all fields enclosed in quote marks. That’s one regex-replace in Sublime Text. With that done, I can change how our line conversion works:

function CSVSet:convertToXSet(line)
    local result = XSet()
    local nameCount = 1
    pat = '"(.-)"[,\n]'
    local betterLine = self:fixLine(line)
    for unquoted in betterLine:gmatch(pat) do
        local label = self.labels[nameCount]
        result:addAt(unquoted, label)
        nameCount = nameCount + 1
    end
    return result:lock()
end

That can be this:

function CSVSet:convertToXSet(line)
    local result = XSet()
    local nameCount = 1
    pat = '"(.-)"[,\n]'
    for unquoted in line:gmatch(pat) do
        local label = self.labels[nameCount]
        result:addAt(unquoted, label)
        nameCount = nameCount + 1
    end
    return result:lock()
end

I’ll run the tests now, and I’m kind of worried, because running a regex on your database is a bit like throwing it into a wood chipper. But we’ll rely on the tests to find any bad lines.

Hm, bad things happened, and it was in the copy-paste. The data tab is messed up. Revert, exit Codea, restart.

The data looks better this time. Tests run. Commit it.

Make the convert change. Test again, and the “CSV to record” test fails. Before I even look I recall that I didn’t change the mini data.

Test still fails. See what it says:

        _:test("Convert CSV to record", function()
            local names = { "last", "first","company","zip","city" }
            local line = '"Jeffries","Ronald","XPROGRAMMING, Inc.",48169,"Pinckney"\n'
            local set = CSVSet(names,line)
            local xset = set:convertToXSet(line)
            _:expect(xset:card()).is(5)
            _:expect(xset:hasAt("Jeffries","last"),"last").is(true)
            _:expect(xset:hasAt("Ronald","first"),"first").is(true)
            _:expect(xset:hasAt("Pinckney","city"),"city").is(true)
            _:expect(xset:hasAt("XPROGRAMMING, Inc.","company"),"company").is(true)
            _:expect(xset:hasAt("48169","zip"),"zip").is(true)
        end)

That test is now allowed to assume that zip is enclosed in quotes. Fix that.

All tests are green. Commit: converted to data with all CSV fields guaranteed in quotes.

Now it’s time to do some work.

The Pattern

Our CSV set has an array of field names 1 to N. It has a giant string of lines, each line containing N fields. My unsophisticated plan is to create a gmatch pattern that will match lines with the desired fields in the desired places.

I’ll start by testing some match operations to let me build up an understanding of the patterns I need.

        _:test("Restrict string pattern", function()
            local data = [["Jeffries","Ron","Pinckney","MI"]]
            local pat = [[".-",".-","Pinckney",".-"]]
            local result = data:match(pat)
            _:expect(result).is(data)
        end)

This test runs. The idea is that we’ll string together ".-", searches for fields we do not find in our restricting record, and the record’s contents (enclosed in quotes) where do do have fields.

This isn’t quite robust enough, because the last element of the pattern can’t have a comma, which I discovered when the test didn’t run. We can either commit to removing the comma, or we could create a more complex pattern.

I’m having the usual trouble with patterns, which is that it’s easy to get it nearly right. Here’s the current running test:

        _:ignore("Restrict string pattern", function()
            local data = [["Jeffries","Ron","Pinckney","MI"\n]]
            local fld = '".-"[,\n]'
            local pat = '".-"[,\n]".-"[,\n]"Pinckney",".-"[,\n]'
            local result = data:match(pat)
            _:expect(result).is(data)
            local assembledPat = fld..fld..'"Pinckney",'..fld
            result = data:match(assembledPat)
            _:expect(result).is(data)
        end)

Even with this advanced pattern that matches comma or newline, to accommodate the last element, we’re not quite there. I think I need to put the separator check into the pattern after the literal.

Let’s break it down further.

        _:ignore("Restrict string pattern", function()
            local data = [["Jeffries","Ron","Pinckney","MI"\n]]
            local fld = '".-"'
            local sep = '[,\n]'
            local pat = '".-"[,\n]".-"[,\n]"Pinckney",".-"[,\n]'
            local result = data:match(pat)
            _:expect(result).is(data)
            local assembledPat = fld..sep..fld..sep..'"Pinckney"'..sep..fld..sep
            result = data:match(assembledPat)
            _:expect(result).is(data)
        end)

That’s pretty atomic, but it works. Now let’s build one from some input data.

Oh no! I had the test up there set to ignore. I’ve ben coding along thinking things were working when they weren’t.

I finally realized that strings in square brackets do not interpret escape characters, which was messing me up. Here’s a test that works and looks like it might have a useful structure:

        _:test("Restrict string pattern", function()
            local result
            local data    = '"Jeffries","Ron","Pinckney","MI"\n'
            local databad = '"Jeffries","Ron","Punknord","MI"\n'
            local fld = '".-"'
            local sep = '[,\n]'
            local assembledPat = fld..sep..fld..sep..'"Pinckney"'..sep..fld..sep
            result = data:match(assembledPat)
            _:expect(result).is(data)
            result = databad:match(assembledPat)
            _:expect(result).is(nil)
        end)

The assembly of the pattern is tedious for a human but should be “easy” for a computer. Let’s do a test.

        _:test("restrict match pattern", function()
            local fld = '".-"'
            local sep = '[,\n]'
            local correctPat = fld..sep..fld..sep..'"Pinckney"'..sep..fld..sep
            local names = {"last","first","city","state"}
            local set = XSet()
            set:addAt("Pinckney", "city")
            local testPat = matchPattern(set, names)
            _:expect(testPat).is(correctPat)
        end)

I plan to write the function right here and then we’ll figure out where it should really belong.

Fail should be missing function.

20: restrict match pattern -- Tests:274: attempt to call a nil value (global 'matchPattern')

Write it empty:

function matchPattern(set,names)
    return ""
end

Should fail with wrong answer:

20: restrict match pattern  -- Actual: , Expected: ".-"[,
]".-"[,
]"Pinckney"[,
]".-"[,
]

Ugly but accurate. CodeaUnit doesn’t love comparing strings with newlines. Or I don’t love how it displays the result.

Anyway the function came together easily, though in the first cut I forgot to iinsert the quotes.

function matchPattern(set,names)
    local fld = '".-"'
    local sep = '[,\n]'
    local matches = {}
    for e,s in set:elements() do
        matches[s] = e
    end
    local result = ""
    for _i,name in ipairs(names) do
        local val = matches[name]
        if val then
            result = result..'"'..val..'"'..sep
        else
            result = result..fld..sep
        end
    end
    return result
end

This runs green. We can improve it.

function matchPattern(set,names)
    local fld = '".-"'
    local sep = '[,\n]'
    local matches = {}
    for e,s in set:elements() do
        matches[s] = '"'..e..'"'
    end
    local result = ""
    for _i,name in ipairs(names) do
        result = result..(matches[name] or fld)..sep
    end
    return result
end

Nice. I’m reminded that I’ve not committed for ages. Better lock in this good stuff. Commit: running tests for restrict matching patterns.

Yes, that was almost two hours without a commit. Very risky out there on that thin ice. It would have been better to have locked in the various pattern discoveries.

I’ve got two hours in. I think I’ll take a break, so let’s sum up.

Summary

We have a pattern style consisting of a series of fld,sep, where fld is either a general matcher of anything in quote, or a literal value in quotes. We string these together to select a series like

anything,anything,PINCKNEY,anything

To create the string for a given XSet intended as an element of a restrict selection set, we save any values from the selection set under their scope (the field name), and then string together either the generic field matcher or that literal.

Set Theory

I came up with that scheme with my programmer hat on. If I were a true set theory wizard, I might have defined that string as a set in set theory. If you hold your mouth just right you can almost see it.

For every name in the scope set of the CSV file, 
  return a set consisting of f(name,restrictor), 
  where f(n,b) = fld..sep if name is not in the scope set of the 
    restrictor and "e"..sep if b hasAt(e,name).

The set is some magical kind of string of characters set, but that’s OK, we could define such a thing.

If we were to do that, and if we were to define suitable set operations on strings, we’d be doing set theory all the way down. We’re not there yet, and we may never be there. But we can perhaps get a glimmer of how it could be that way.

That said, I’m comfortable with what we have here. If I had written the thing out in set theory, I’d have wanted to code up the method about this tight anyway. It would just have a better name and a tighter definition.

What’s Next?

I think we’ll just press on. We’ll figure out a reasonable place to put that matchPattern function, probably on XSet, Then, I hope with some small sensible test-driven steps, we’ll drive out the new magical CSV restrict function. And it will be far faster than the one we have now.

Unless I miss my guess. That could happen.

We’ll see. I hope you’ll tune in.