Thinking about function sets. Maybe doing.

FAFO on GitHub

If there was a kind of set that was expressed as a function, we could possibly pipeline operations, reducing memory impact. Is that possible, and is it a good idea? So far, maybe not.

Good morning, friends!

It is 0700 hours here and I am awake and rarin’ to go. I’ve been thinking about sets as functions or function sets or whatever we might call it. As all here know—or do not care about—a function is really just a set of ordered pairs (for values of “really”). The pairs are pairs of elements, one from the “domain” of the function, the set of valid inputs, and one from the “range”, the set of valid outputs. We are generally interested only in the value of a function at a specific point: square root of two sort of thing. In our XST program, the value is a set, the function evaluated at a specific input: intersection of these two sets sort of thing.

Our set operations can be thought of as doing two things. They gnaw away at what could be a lot of input values, and they produce some number of output values, which they store in memory in some XImplementation. Sometimes the latter of those two things is a problem.

As a truly awful case, let’s consider a that we decide to rename the “pay” field of the EE set to “compensation”. Unless we do some clever optimization, accomplishing this would read all the records of the EE set, breaking them up into value-name (element-scope) pairs, and produce all those pairs as output, except that value-pay would turn into value-compensation.

At the end of this operation we would have a complete copy of the EE set, in memory, all broken out into value-name pairs.

Now, we might well do something intentional to optimize that. Our setops have the ability to ask the set’s implementation if it would care to do the job, and in the case of the EE set, we could cleverly create a new symbol table, with “compensation” where “pay” used to be, and then create a new XSet pointing to the same file with the different symbol table. That wouldn’t process any records at all, and that would be great.

Channeling Dr Ian Malcolm

I am in that liminal space where I could do something and would be wise to think about whether I should do it.

It would be good to know how to do this thing. Why? Because learning how to make the computer do things is our business, and the more we know, the more capable, productive, and joyful our work is. And it surely wouldn’t take much time to try it. I’m sure I could do one example this morning.

But would the thing be useful? It’s all well and good to think about streaming through a number of operations, passing small amounts of data along until finally some function at the end decides to actually save the final answer, which might be very small.

On the other hand, this tiny laptop that I use to run this code has 16GB of memory. My sample personnel flat file record is 44 bytes long. The one at Chrysler was, I forget, maybe 1000 bytes. My laptop has room for 16 million such records, although that wouldn’t leave much memory for the program and my 37 browser tabs. I might have to upgrade my laptop.

Python, of course, is built to create and destroy objects all day long. If we run a series of set operations now, it will allocate whatever memory is needed and as soon as we let go of those objects, it can reuse that memory. Let’s write a test that uses a bunch of memory.

    def test_waste_memory(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ee = XSet(ff)
        assert ee.cardinality() == 1000
        jeffries = ee.select(lambda e, s: e.includes('jeffries', 'last'))
        assert jeffries.cardinality() == 200
        ron = ee.select(lambda e, s: e.includes('ron', 'first'))
        assert ron.cardinality() == 100
        coder = ee.select(lambda e, s: e.includes('coder', 'job'))
        assert coder.cardinality() == 200
        high = ee.select(lambda e, s: e.includes('12000', 'pay'))
        assert high.cardinality() == 250
        ron_jeffries = ron.intersect(jeffries)
        assert ron_jeffries.cardinality() == 20
        high_coder = coder & high
        assert high_coder.cardinality() == 50
        final = ron_jeffries & high_coder
        assert final.cardinality() == 1
        employee, scope = final.pop()
        assert employee.includes('jeffries', 'last')
        assert employee.includes('ron', 'first')
        assert employee.includes('coder', 'job')
        assert employee.includes('12000', 'pay')

So this is probably not an unreasonable example of a thing that could happen, although we’d do better to do it this way:

    def test_do_not_waste_memory(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ee = XSet(ff)

        def sel(e, s):
            return (e.includes('jeffries', 'last') and
                    e.includes('ron', 'first') and
                    e.includes('coder', 'job') and
                    e.includes('12000', 'pay'))

        final = ee.select(sel)
        assert final.cardinality() == 1
        employee, scope = final.pop()
        assert employee.includes('jeffries', 'last')
        assert employee.includes('ron', 'first')
        assert employee.includes('coder', 'job')
        assert employee.includes('12000', 'pay')

The second test does it all in one pass over the thousand records, while the wasteful one does it with four passes over the 1000 records.

The thing is, all 91 of my tests run in 155 milliseconds. If they create the jobs__db file, 177 ms.

As a learning experiment, I might set up a streaming function “set”. Not this morning: interruptions have occurred but more importantly, I see no evidence that they’re needed. For some reason, that’s holding me back, at least until things become more clear to me.

Tempting: It is very tempting to do this thing, just to flex. But not yet. Waiting is.

I think what would be interesting would be to make a set of 100,000 records in a file, and do some tests on that. Remember that we open and close the file on every record access, pretty much the slowest possible way to read a file. And it’s still quite quick.

Here’s a test that reads the whole file 100 times, which means 100,000 file open, read, file close operations.

    def test_100_thousand(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ee = XSet(ff)
        start = datetime.now()
        for i in range(100):
            ron = ee.select(lambda e, s: e.includes('jeffries', 'last'))
        elapsed = datetime.now() - start
        print(elapsed.total_seconds())
        assert elapsed.total_seconds() < 1.5

The test runs in 1.35 seconds. Not thrillingly fast, but hm let’s see, 0.0000135 seconds per read, 13.5 microseconds per open/read/close. Imagine what we could do if we did a little file buffering.

The function idea remains interesting, but for now, the payoff just isn’t there. Maybe when I get a bit more thinking in, it will come up again. One possibility is that it might be that expressing a solution would be easier in a more function-oriented fashion. That might be worth exploring but honestly I think we’d still mostly store intermediate answers in temps, even if they were just functions instead of actual values. I suspect the code would look the same. Unless we went to a LISP style again … which isn’t out of the question, I suppose.)

For now, what we really need is a real problem to point this thing at.

Failing that … I’ll think of something. See you then!