Getting started with CSV data. And reporting a conversation.

First the conversation. In our Friday Night Zoom Ensemble last night (Tuesday) I was talking about this project and why Extended Set Theory might be a good idea. I found myself waving my hands a lot, partly to show “so there’s this data over here and that data over there”, but also because it seems to take a lot of hand-waving to explain why this could ever be more than an amusing early morning coding project.

When I explain why XST could be a good basis for building a real information system, I wind up referring to the ability to have the information, such as records with LastName, FirstName, MiddleInitial, separate from the data, which might variously be a table of fixed length records or a text file full of CSV lines.

The “idea” is that we can express our query in a formal language, which can come down to set theoretic statements, because we’re doing set theory. This is not new. If you have studied the history of relational databases, the language SQL can come down to expressions in relational algebra, which amounts to a bunch of functions like Project, Join, Rename, whatever. These operations are intended to be “well defined” and founded in a way that allows them to be combined with each other, to produce any desired result.

XST is a different formalism from relational algebra, and it is arguably better founded, in that it can express data structures that relational algebra cannot. For example, you can’t really express, say, a B-tree index in relational algebra. It’s not a relation. But you can express it in set theory, and you can express its relationship with the data that it indexes.

So, in principle, this means that as you build your relational system, you’ll come upon structures where you have to back out of relational thinking, and plug in some perfectly reasonable thing like an index, just by coding. Using set theory, you could continue your thinking at the more abstract level, which would give you leverage in terms of the ease of defining your b-tree, and creating it with confidence that it will behave as intended.

I’ve worked with XST for years, and while I have never even once seen it come to full fruition, I am confident that the benefits it offers–the benefits that Dave Childs has touted for decades–are real and attainable.

They’re just very difficult to attain. I think the main reason is that to attain them you have to combine deep mathematics with deep programming. Few of us, myself included, can do that particularly well.

We, here, are not going to reach the heights of XST, nor are we going to plumb to its depths. I do think we’ll produce some interesting results that set theory makes easier.

I emphasize “easier”. Nothing I’m doing here couldn’t be accomplished by ordinary programming by any good programmer. There is no magic sauce, no set theory equation, no Philosopher’s Stone, without which we cannot program anything we can imagine. XST might make it easier. It doesn’t make anything possible that was formerly impossible.

And it’s really fun to program it. Let’s get to that part.

CSV Data

Looking for sample data, I found this site, which offers some smallish data sets free and larger ones for a small fee. We’ll make do with a free one. I don’t really want ten million records on my iPad.

I’ll download the data on my Mac, and copy-paste it into a Codea tab. This may save me figuring out how to read a file, which may or may not be possible in Codea and certainly isn’t interesting.

I create a tab called CSVdata and paste into it:

CSVnames = { "first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web" }

CSVdata = [["James","Butt","Benton, John B Jr","6649 N Blue Gum St","New Orleans","Orleans","LA",70116,"504-621-8927","504-845-1427","jbutt@gmail.com","http://www.bentonjohnbjr.com"
"Josephine","Darakjy","Chanay, Jeffrey A Esq","4 B Blue Ridge Blvd","Brighton","Livingston","MI",48116,"810-292-9388","810-374-9840","josephine_darakjy@darakjy.org","http://www.chanayjeffreyaesq.com"
...
"Jani","Biddy","Warehouse Office & Paper Prod","61556 W 20th Ave","Seattle","King","WA",98104,"206-711-6498","206-395-6284","jbiddy@yahoo.com","http://www.warehouseofficepaperprod.com"
"Chauncey","Motley","Affiliated With Travelodge","63 E Aurora Dr","Orlando","Orange","FL",32804,"407-413-4842","407-557-8857","chauncey_motley@aol.com","http://www.affiliatedwithtravelodge.com"
]]

I decide to make a smaller set to play with:

CSVminidata = [["James","Butt","Benton, John B Jr","6649 N Blue Gum St","New Orleans","Orleans","LA",70116,"504-621-8927","504-845-1427","jbutt@gmail.com","http://www.bentonjohnbjr.com"
"Josephine","Darakjy","Chanay, Jeffrey A Esq","4 B Blue Ridge Blvd","Brighton","Livingston","MI",48116,"810-292-9388","810-374-9840","josephine_darakjy@darakjy.org","http://www.chanayjeffreyaesq.com"
"Art","Venere","Chemel, James L Cpa","8 W Cerritos Ave #54","Bridgeport","Gloucester","NJ","08014","856-636-8749","856-264-4130","art@venere.org","http://www.chemeljameslcpa.com"
]]

That seems prudent.

Now we have to figure out what to do with this stuff. That deserves a new heading.

What’s The Plan?

My overall plan is to do a restrict operation on this data, retrieving some records based on a set containing one or more partially matching records. (That’s what restrict does, after all.) I plan to do it in two different ways.

We’ll begin with a new subclass of XST to contain the CSV data. We’ll probably do some refactoring there, perhaps to wind up with one abstract XST class and two concrete ones, one with our current data structure and one with the new CSV one.

We’ll implement the iterator elements on the CSV data, which should be sufficient to allow us to do a restrict on it. The iterator will essentially unpack the CSV data and produce records in our standard format, after which the restrict operation should run. (I expect there will be some discoveries here, but it should be essentially straightforward.)

And then (a miracle happens) we’ll try to implement a restrict operator in the CSVset class, one that does the operation more efficiently. I’m hoping that it will be much more efficiently.

If and when this works, and it won’t be today, we’ll have demonstrated, in a small way, that the XST style of thinking lets us process multiple forms of data, and even lets us do so rather efficiently.

If it doesn’t work, well, we’ll have learned something. But I think it will work, so if it doesn’t, we’ll have to wait until then to find out what we’ve learned.

To that end …

Begin With a Test

I wrote that header so that I would be required to start with a test even though I’d really like to start with code.

OK, what would be easy? How about this one:

        _:test("CSVset", function()
            local csv = CSVSet(CSVminidata)
            _:expect(csv:card()).is(3)
        end)
13: CSVset -- Tests:194: attempt to call a nil value (global 'CSVSet')

That demands the class. I’m going to code a bit more than absolutely needed. So sue me.

CSVset = class(XSet)

function CSVset:init(labels,data)
    self.lables = labels
    self.data = data
end

I realized as I was writing it that I need the labels as well as the data, so I’ll change the test:

        _:test("CSVset", function()
            local csv = CSVSet(CSVnames, CSVminidata)
            _:expect(csv:card()).is(3)
        end)

I expect a failure on card. I am surprised twice. First, I need to decide whether to capitalize the S in set or not. Let’s go with yes, as XSet has it.

The other surprise was this:

13: CSVSet -- Tests:225: Tests:249: bad argument #1 to 'for iterator' (table expected, got nil)

Since my CSVSet is a subclass of XSet, it’s trying to do the existing card operator, which counts the records, and of course elements can’t cope with this guy. (That does give me an idea, but probably a bad one. We could provide an expanded contents in the XSet format. Let’s not go there, at least not now.)

Let’s implement card. It seems possible.

function CSVSet:card()
    local pat = ".-\n"
    local rep = "\n"
    local nls = self.data:gsub(pat,rep)
    return nls:len()
end

The test runs. I’m almost surprised. See what I did there? I replaced all the lines “.-\n” with “\n” and then counted them.

Let’s replace with x instead, just so that I can print the answer.

function CSVSet:card()
    local pat = ".-\n"
    local rep = "X"
    local nls = self.data:gsub(pat,rep)
    print(nls)
    return nls:len()
end

That prints XXX. I remain surprised: that worked the first time. Perhaps I am in fact a hairy wizard … but probably not.

Commit: CSVset can respond to ``card`.

Let’s test card further by checking the cardinality of the big set.

        _:test("CSVSet", function()
            local csv = CSVSet(CSVnames, CSVminidata)
            _:expect(csv:card()).is(3)
            csv = CSVSet(CSVnames, CSVdata)
            _:expect(csv:card()).is(500)
        end)

Runs. Commit: card works twice.

Was That a Trick?

You may be asking yourself whether that implementation of card is legitimate or some kind of a trick. I think it is both.

Now I have not done the set theory to show me that the cardinality of a CSV set is equal to the number of lines, but you don’t need a degree in math to see that it is. So counting the lines is legit. It’s a trick, in that we didn’t do the set theory, but it’s not a bad trick. It is clever code, not set theoretic code, but it implements a set theoretic operator correctly.

I have, however, almost by accident, revealed what I plan to do as the final reveal in this phase, implementing restrict directly on the CSVSet. I plan to implement it with some kind of fancy string manipulation, yet to be figured out.

We do need to implement elements on our set, and to drive that out, I think I’ll turn off our local card so that our test will use the top level one.

13: CSVSet -- Tests:227: Tests:251: bad argument #1 to 'for iterator' (table expected, got nil)

That error is in the standard card:

function XSet:card()
    local card = 0
    for e,s in self:elements() do
        card = card+1
    end
    return card
end

We’ll need to implement elements on CSVSet now, but it can be pretty simple. We will, of course, emulate the original:

function XSet:elements()
    -- return an iterator for contents
    return coroutine.wrap(function() 
        for scope,elements in pairs(self.contents) do
            for element,_true in pairs(elements) do
                coroutine.yield(element,scope)
            end
        end
    end)
end

I think I’ll make card work, but leave out all the stuff that makes a legitimate element.

function CSVSet:xxcard()
    local pat = ".-\n"
    local rep = "X"
    local nls = self.data:gsub(pat,rep)
    return nls:len()
end

function CSVSet:elements()
    -- return an iterator for contents
    local pat = ".-\n"
    return coroutine.wrap(function() 
        for line in self.data:gmatch(pat) do
            coroutine.yield(line)
        end
    end)
end

The test still runs. Of course the line isn’t formatted as a set, but we’re just counting in card so no harm done. I want to print just to see it work.

        _:test("CSVSet", function()
            local csv = CSVSet(CSVnames, CSVminidata)
            _:expect(csv:card()).is(3)
            for line in csv:elements() do
                print(line)
            end
            csv = CSVSet(CSVnames, CSVdata)
            _:expect(csv:card()).is(500)
        end)

And we get this in our output:

people

Nice. Not our format, but nice.

Let’s see about creating a legitimate set for this data.

An important fact–at least it seems important to me–is that we don’t have to create every amazing kind of flat and curly set here inside the CSVdata implementation. There is only one kind of thing here, a set of records.

Let’s TDD that creation. We can see what we’d like to say in our elements implementation:

coroutine.yield(self:convertToXSet(line))

So we’ll see about TDDing that:

        _:test("Convert CSV to record", function()
            local names = { "last", "first","city" }
            local line = '"Jeffries","Ronald","Pinckney"\n'
            local set = CSVSet(names,line)
            local xset = set:convertToXSet(line)
            _:expect(xset:card()).is(3)
            _:expect(xset:hasAt("Jeffries","last")).is(true)
            _:expect(xset:hasAt("Ronald","first")).is(true)
            _:expect(xset:hasAt("Pinckney","city")).is(true)
        end)

I’m expecting to fail on the convert.

14: Convert CSV to record -- Tests:204: attempt to call a nil value (method 'convertToXSet')

Now a simple matter of implementing it.

Um, simple. Let’s see. What if we looped over a match on the line, and kept a counter of the index in the names table … yeah, something like that …

A bit of mistake-making with gmatch and:

function CSVSet:convertToXSet(line)
    local result = XSet()
    local nameCount = 1
    pat = ".-[,\n]"
    for fieldInQuotes in line:gmatch(pat) do
        local unquoted = fieldInQuotes:sub(2,-3)
        result:addAt(unquoted, self.labels[nameCount])
        nameCount = nameCount + 1
    end
    return result:lock()
end

That passes the test. Commit: Can convert CSVSet record to XSet.

So, this is nice. I should, in principle, be able to do a restrict on the set now, I think. We’ll try the little one first of course.

I have high hopes for this but (spoiler alert) they are about to be dashed:

        _:test("CSVSet restrict", function()
            local csv = CSVSet(CSVnames, CSVminidata)
            local brighton = XSet()
            brighton:addAt("Brighton", "city")
            local restrictor = XSet()
            restrictor:addAt(brighton,NULL)
            local restricted = csv:restrict(restrictor)
            _:expect(restricted:card()).is(1)
        end)

When I run the test, I get this:

15: CSVSet restrict -- Tests:333: attempt to call a nil value (method 'hasAt')

Best look at that:

function XSet:isSubset(other)
    -- return whether self is a subset of other
    -- for every element e,s of self
    --   is e element-s of other
    return self:every(function(e,s) return other:hasAt(e,s) end)
end

That’s called from here:

function XSet:restrict(B)
    -- return all elements (a) of self such that
    --  there exists a record (b) in B such that
    --      b:subset(a)
    return self:select(function(a,s) 
        return B:exists(function(b,s) 
            return b:isSubset(a) 
        end) 
    end)
end

It seems to me that when we’re in restrict, a is our record, which should be converted to an XSet at this point, since we’re calling the CSVSet’s elements function, via select. And B should be our restrictor, which has one element, our brighton set. It seems to me that both the inputs here should be XSet instances. I need to print. That will require me to ignore a bunch of other tests for a bit, since CodeaUnit won’t let me run just one.

A print in isSubset tells me immediately that I’m passing the line, not the XSet of the line.

I finally realize that I’ve not changed the elements method. It should be this:

function CSVSet:elements()
    -- return an iterator for contents
    local pat = ".-\n"
    return coroutine.wrap(function() 
        for line in self.data:gmatch(pat) do
            coroutine.yield(self:convertToXSet(line))
        end
    end)
end

Run the test again.

13: CSVSet -- Tests:248: Tests:240: table index is nil

That means that the names table ran out of names, I think.

The error arises here:

function XSet:addAt(element, scope)
    if self.locked then error("set is locked") end
    if self:hasAt(element,scope) then return end
    if not self.contents[scope] then
        self.contents[scope] = {} -- <---
    end
    self.contents[scope][element] = true
    return self
end

So we called with a nil scope. Let me print there:

Ah. I see the error. There are commas in the data, which means that I need a smarter matching function to pull the line apart. We can do that.

Arrgh. No, we can’t, quite. It turns out that not all the fields are enclosed in quotes. Zip codes, for example, are not.

Ah, data cleaning, gotta love it.

Should I try a more clever match, or hammer my data to ensure quotes? Either way, my point will be proven, but a smarter pattern match feels a bit more in the spirit of things. I’ll write a test for that.

I was hoping to use Lua’s balanced delimiter pattern but the delimiters have to be distinct. So that won’t do. Let’s see about inserting quotes.

        _:test("Smarter Match", function()
            local line = '"jeffries","ron",49169,"pinckney"'
            local noLeft = ',([^"])'
            local addLeft = ',"%1'
            local fixed = line:gsub(noLeft,addLeft)
            _:expect(fixed).is('"jeffries","ron","49169,"pinckney"')
            local noRight = '([^"]),'
            local addRight = '%1",'
            fixed = fixed:gsub(noRight, addRight)
            _:expect(fixed).is('"jeffries","ron","49169","pinckney"')
        end)

OK this is admittedly nasty, but we’re on a different mission.

Let’s change the test to make us push that code into CSVSet.

        _:test("Smarter Match", function()
            local line = '"jeffries","ron",49169,"pinckney"'
            local fixed = CSVSet:fixLine(line)
            _:expect(fixed).is('"jeffries","ron","49169","pinckney"')
        end)

function CSVSet:fixLine(line)
    local noLeft = ',([^"])'
    local addLeft = ',"%1'
    local noRight = '([^"]),'
    local addRight = '%1",'
    return line:gsub(noLeft,addLeft):gsub(noRight, addRight)
end

That passes. Use it:

function CSVSet:convertToXSet(line)
    local result = XSet()
    local nameCount = 1
    pat = '"(.-)"[,\n]'
    for unquoted in self:fixLine(line):gmatch(pat) do
        --local unquoted = fieldInQuotes:sub(2,-3)
        result:addAt(unquoted, self.labels[nameCount])
        nameCount = nameCount + 1
    end
    return result:lock()
end

Unignore some key tests:

        _:test("CSVSet", function()
            local csv = CSVSet(CSVnames, CSVminidata)
            _:expect(csv:card()).is(3)
            csv = CSVSet(CSVnames, CSVdata)
            _:expect(csv:card()).is(500)
        end)
        
        _:test("Convert CSV to record", function()
            local names = { "last", "first","zip","city" }
            local line = '"Jeffries","Ronald",48169,"Pinckney"\n'
            local set = CSVSet(names,line)
            local xset = set:convertToXSet(line)
            _:expect(xset:card()).is(3)
            _:expect(xset:hasAt("Jeffries","last")).is(true)
            _:expect(xset:hasAt("Ronald","first")).is(true)
            _:expect(xset:hasAt("Pinckney","city")).is(true)
            _:expect(xset:hasAt("48169","zip")).is(true)
            print("CSV to Record ENDED")
        end)
        
        _:test("CSVSet restrict", function()
            local csv = CSVSet(CSVnames, CSVminidata)
            local brighton = XSet()
            brighton:addAt("Brighton", "city")
            local restrictor = XSet()
            restrictor:addAt(brighton,NULL)
            local restricted = csv:restrict(restrictor)
            _:expect(restricted:card()).is(1)
        end)

I enhanced the “Convert CSV to record” test to include an unquoted zip code, as an extra check. Run them, see what explodes.

A few things do explode. This is getting fragile. I’m losing the picture. Too many changes before a commit. But I’ll push on. One test at a time.

13: CSVSet -- Tests:257: Tests:249: table index is nil
function XSet:addAt(element, scope)
    if self.locked then error("set is locked") end
    --print("addAt ", element, scope)
    if self:hasAt(element,scope) then return end
    if not self.contents[scope] then
        self.contents[scope] = {}
    end
    self.contents[scope][element] = true
    return self
end

That’s a nil scope again. Let’s put some prints back.

function CSVSet:elements()
    -- return an iterator for contents
    local pat = ".-\n"
    return coroutine.wrap(function() 
        for line in self.data:gmatch(pat) do
            print("elements ", line)
            local converted = self:convertToXSet(line)
            print("Converted ", converted:display())
            coroutine.yield(converted)
        end
    end)
end

That prints the line, but not “converted”. So we need a print inside the convert.

I discover that my better line isn’t better, because it tries to convert commas without quotes in the middle of quoted strings. I’ll write a failing test for that.

        _:test("Smarter Match", function()
            local line = '"jeffries","ron","XPROGRAMMING, Inc.",49169,"pinckney"'
            local fixed = CSVSet:fixLine(line)
            _:expect(fixed).is('"jeffries","ron","XPROGRAMMING, Inc.","49169","pinckney"')
        end)
16: Smarter Match  -- Actual: "jeffries","ron","XPROGRAMMING"," Inc.","49169","pinckney", Expected: "jeffries","ron","XPROGRAMMING, Inc.","49169","pinckney"

Can I even fix this?

OK, given that I know what he’s not quoting, I think I can. Let’s try this new scheme:

function CSVSet:fixLine(line)
    local noLeft = ',(%d)'
    local addLeft = ',"%1'
    local noRight = '(%d),'
    local addRight = '%1",'
    return line:gsub(noLeft,addLeft):gsub(noRight, addRight)
end

Now I’m looking for comma-digit and digit-comma. That test passes. I’ll unignore some more (again).

This runs:

        _:test("Convert CSV to record", function()
            local names = { "last", "first","company","zip","city" }
            local line = '"Jeffries","Ronald","XPROGRAMMING, Inc.",48169,"Pinckney"\n'
            local set = CSVSet(names,line)
            local xset = set:convertToXSet(line)
            _:expect(xset:card()).is(5)
            _:expect(xset:hasAt("Jeffries","last"),"last").is(true)
            _:expect(xset:hasAt("Ronald","first"),"first").is(true)
            _:expect(xset:hasAt("Pinckney","city"),"city").is(true)
            _:expect(xset:hasAt("XPROGRAMMING, Inc.","company"),"company").is(true)
            _:expect(xset:hasAt("48169","zip"),"zip").is(true)
        end)

That was a fairly comprehensive one. Unignore further:

Everything runs but this:

15: CSVSet restrict -- Tests:249: table index is nil

OK, we’ve still run off the end of the line. Our convertLine is probably still creating an extra field.

I find the bug. The bug is this:

function CSVSet:elements()
    -- return an iterator for contents
    local pat = ".-\n"
    return coroutine.wrap(function() 
        for line in self.data:gmatch(pat) do
            local converted = self:convertToXSet(line)
            coroutine.yield(converted,NULL)
        end
    end)
end

I formerly had just coroutine.yield(converted) with no scope provided. That was defaulting to a nil, which is bad news.

The good news is that all the tests are running, including this one:

        _:test("CSVSet restrict", function()
            local csv = CSVSet(CSVnames, CSVminidata)
            local brighton = XSet()
            brighton:addAt("Brighton", "city")
            local restrictor = XSet()
            restrictor:addAt(brighton,NULL)
            local restricted = csv:restrict(restrictor)
            _:expect(restricted:card()).is(1)
        end)

Let’s beef up that test a bit. I need to get a record to check. Let’s do this:

        _:test("CSVSet restrict", function()
            local csv = CSVSet(CSVnames, CSVminidata)
            local brighton = XSet()
            brighton:addAt("Brighton", "city")
            local restrictor = XSet()
            restrictor:addAt(brighton,NULL)
            local restricted = csv:restrict(restrictor)
            _:expect(restricted:card()).is(1)
            local record = NULL
            for e,s in restricted:elements() do
                record = e
            end
            _:expect(record:hasAt("Josephine","first_name")).is(true)
        end)

That passes. One more test.

        _:test("CSV big restrict", function()
            local csv = CSVSet(CSVnames, CSVdata)
            local mi = XSet()
            mi: addAt("MI","state")
            local restrictor = XSet()
            restrictor:addAt(mi,NULL)
            local restricted = csv:restrict(restrictor)
            _:expect(restricted:card()).is(14)
        end)

That test passes. How did I get 14? I counted them with the help of Sublime Text.

We just executed an XST restrict command on a set of 500 records in CSV format. Commit: restrict works on CSVSets.

It’s 1218 and I’ve been at this since 8:30 or so. Let’s sum up.

Summary (Party!)

The very good news is that we have implemented a new class in our XST program, CSVSet, a subclass of XSet, containing unmodified but error-free CSV data, and we’ve implemented the elements method on it. That method cleans up the lines a bit, then parses them into XSet records, applying the field names from the CSV.

Because aCSVSet:elements() produces XSet instances, we can run operations such as restrict on our CSV data without further ado. I’m confident that other set operations would work if we had use of them: most everything comes down to elements. However, we have not yet implemented hasAt on the set, so there are probably things we couldn’t do without more work, like set equality. We can do it on the individual records, of course, since they appear as ordinary XSets.

This went amazingly smoothly. The only glitches took place when I discovered the need to enclose numeric codes in quotes, which took me a few tries. And I made the usual number of silly errors, like forgetting to yield the NULL scope.

I did feel confused there toward the end, when I really expected things to work and they didn’t, but following my nose, and the tests, turned up the issue quickly enough.

Is this a deep result, and a grand success for Extended Set Theory? I wouldn’t go that far. Certainly there was no set theoretic magic in my conversion of the CSX lines into XST records, or at least it didn’t show up on the screen. So I would not argue that I used set theory to make CSV work. I would, however, point out that in fact we did make CSV files behave like XSets well enough to permit us to query them with standard XST methods. Well, method: restrict.

For our next trick in the next day or two, we’ll do some timing on this implementation of restrict, and we’ll implement a specialized version for CSVSets. (OK, I admit that I don’t quite know how I’m going to do that but that’s literally days in the future.) Then we’ll do some timing and see what we get.

Which reminds me, let’s try timing card. We’re still using the XST one.

        _:test("Card Timing", function()
            csv = CSVSet(CSVnames, CSVdata)
            local time1 = os.time()
            for i = 1,1000 do
                csv:card()
            end
            local time2 = os.time()
            local elapsed = os.difftime(time2,time1)
            print(elapsed)
            local f1 = os.time()
            for i = 1,10000 do
                csv:xxcard()
            end
            local f2 = os.time()
            local fast = os.difftime(f2,f1)
            print(fast/10)
        end)

Notice that I’m doing 1000 of the slow on and 10,000 of the fast one, then dividing out the 10. Lua time is only good to the second. There may be something better, but I’d have to look. This prints numbers like 7.0 and 0.6. So the line-checking card is 10x faster. I’ll remove the xx and use it.

function CSVSet:card()
    local pat = ".-\n"
    local rep = "X"
    local nls = self.data:gsub(pat,rep)
    return nls:len()
end

I remove the timing check, determine tests are green, and commit: CSVSet uses faster card method.

Let’s ship it. An exciting little success today. See you next time!