Finishing

FAFO on GitHub

The thing about learning is that when we have sucked most of the learning juice out of some fruitful idea, there is often work to be done to get the thing finished. We have at least two of those hanging fruits right now. One of them has an interesting effect!

Hello, friends. I apologize for the mixed fruit salad metaphor there. But we do have at least two things partially done, calculations and our latest invention, the XSelection object. We really should get those into play.

XSelect is a new XImplementation that represents a conditional operation on a set, selecting only elements that meet the condition. It is presently triggered only by an experimental record, but is intended to replace the existing select operator in XSet. Select is used by the restrict operator and can be used directly. XSelection streams its result rather than accumulating its records and then iterating them.

XCalculation is an XImplementation that represents one or more calculated fields applied to a set of sets. It streams its result, producing sets that contain the fields of the source set plus the calculated values. There is presently no XSet operation to create an XCalculation: they are only used in tests.

XSelect

Let’s see about plugging in the XSelect. We have a temporary selection method in XSet:

class XSet:
    def select(self, cond) -> Self:
        tuples = ((e, s) for e, s in self if cond(e, s))
        return XSet.from_tuples(tuples)

    def x_select(self, cond) -> Self:
        from test_x_select import XSelect
        x_sel = XSelect(self, cond)
        return XSet(x_sel)

The plan would be to replace select with what’s in x_select, so that all selections will stream. This seems to me to be entirely harmless, even entirely good.

We need to move XSelect into the src tree: it is currently in the tst tree. That is trivial, using PyCharm, once I remember how to type F6. Commit: Move XSelect to src.

Now we’ll remove the old select, rename the x_select, and see if we have tests to rewire. I think I did that in the right order. We are green. Commit: all select operations use XSelect.

Reflection

This is actually rather a nice result. All our selections and restrictions are now using the XSelect streaming object. We implemented it in just a few steps yesterday or the day before, and plugged it in this morning by removing a method and renaming one, and our tests all go green. I have a lot of confidence in this program. That’s not to say that there may not be any defects, but with 150 tests, and tests for (almost?) every operation and class, I am feeling confident.

Does confidence go before a fall? We’ll find out. I’m just reporting the feeling. Sometimes I do not feel confident and that signals that I could benefit from more tests or from code that is more clear, or from a better allocation of responsibility to objects. Here, now, today … I feel confident.

XCalculation

If XSelect is the right name, is XCalculation the right name? What are the subclasses of XImplementation?

popup showing list of subclasses

XCalculation, XFlat, XFlatFile, XFrozen, XSelect, XTuple. Meh. Close enough to reasonable names.

Let’s review the XCalculation tests to see where we were when we saw the squirrel.

There are two direct tests, one of them marked to skip until later. Perhaps later will turn out to be now:

class TestExpressionSets:
    @pytest.mark.skip("later")
    def test_total_pay(self):
        p1 = XSet.from_tuples((('joe', 'name'),('10000', 'salary'), ('2345', 'bonus')))
        p2 = XSet.from_tuples((('sam', 'name'),('50301', 'salary'), ('4020', 'bonus')))
        personnel = XSet.n_tuple((p1, p2))
        total = 'totalpay = salary + bonus'
        result = personnel.calculate([total])
        r1 = result[1]
        tp1 = r1['totalpay']
        assert tp1 == '12345'
        r2 = result[2]
        tp2 = r2['totalpay']
        assert tp2 == '54321'

    def test_x_calculation_iter(self):
        p1 = XSet.from_tuples((('joe', 'name'),('10000', 'salary'), ('2345', 'bonus')))
        p2 = XSet.from_tuples((('sam', 'name'),('50301', 'salary'), ('4020', 'bonus')))
        personnel = XSet.n_tuple((p1, p2))
        total = 'totalpay = salary + bonus'
        double = 'double = bonus * 2'
        x_calc = XCalculation(personnel, [total, double])
        for e, s in x_calc:
            if s == 1:
                assert e['name'] == 'joe'
                assert e['totalpay'] == '12345'
                assert e['double'] == '4690'
            elif s == 2:
                assert e['name'] == 'sam'
                assert e['totalpay'] == '54321'
                assert e['double'] == '8040'

I think we fixed the lexer to allow underbar in the names, so let’s change totalpay to total_pay throughout. Yum I love me some multi-cursor action. All changed, all green. Commit: Rename calculated field to check underbar.

In addition to these two tests, there are 18 tests of the details of lexing, parsing, and calculating in another file. I think we are ready to move XCalculation to src and make the skipped test run.

Would it be foolish to do it in that order? Wouldn’t it be prudent to make the test work first? I think it would. Let’s look at it in detail:

    @pytest.mark.skip("later")
    def test_total_pay(self):
        p1 = XSet.from_tuples((('joe', 'name'),('10000', 'salary'), ('2345', 'bonus')))
        p2 = XSet.from_tuples((('sam', 'name'),('50301', 'salary'), ('4020', 'bonus')))
        personnel = XSet.n_tuple((p1, p2))
        total = 'total_pay = salary + bonus'
        result = personnel.calculate([total])
        r1 = result[1]
        tp1 = r1['total_pay']
        assert tp1 == '12345'
        r2 = result[2]
        tp2 = r2['total_pay']
        assert tp2 == '54321'

Seems straightforward enough. We “just” need the calculate method on XSet.¹

Reviewing the other test, we see the steps that will need to be taken. Well, we see one thing, the creation of the XCalculation XImplementation. Let’s write a new test that patches that into an XSet and check the result. It’s one step along the way. I’d rather take a few small steps rather than one big one.

    def test_x_calculation_patch_set(self):
        p1 = XSet.from_tuples((('joe', 'name'),('10000', 'salary'), ('2345', 'bonus')))
        p2 = XSet.from_tuples((('sam', 'name'),('50301', 'salary'), ('4020', 'bonus')))
        personnel = XSet.n_tuple((p1, p2))
        total = 'total_pay = salary + bonus'
        double = 'double = bonus * 2'
        x_calc = XCalculation(personnel, [total, double])
        calc_set = XSet(x_calc)
        joe_rec = calc_set[1]
        assert joe_rec['name'] == 'joe'
        assert joe_rec['total_pay'] == '12345'
        assert joe_rec['double'] == '4690'
        sam_rec = calc_set[2]
        assert sam_rec['name'] == 'sam'
        assert sam_rec['total_pay'] == '54321'
        assert sam_rec['double'] == '8040'

That was easy. Now let’s un-skip that other test and make it go. No, wait, commit: additional test.

Removing the skip, we find that the test fails for want of XSet calculate:

    def test_total_pay(self):
        p1 = XSet.from_tuples((('joe', 'name'),('10000', 'salary'), ('2345', 'bonus')))
        p2 = XSet.from_tuples((('sam', 'name'),('50301', 'salary'), ('4020', 'bonus')))
        personnel = XSet.n_tuple((p1, p2))
        total = 'total_pay = salary + bonus'
        result = personnel.calculate([total])
        r1 = result[1]
        tp1 = r1['total_pay']
        assert tp1 == '12345'
        r2 = result[2]
        tp2 = r2['total_pay']
        assert tp2 == '54321'

Let’s make that test look more like the one above, which is more clear and more comprehensive:

    def test_total_pay(self):
        p1 = XSet.from_tuples((('joe', 'name'),('10000', 'salary'), ('2345', 'bonus')))
        p2 = XSet.from_tuples((('sam', 'name'),('50301', 'salary'), ('4020', 'bonus')))
        personnel = XSet.n_tuple((p1, p2))
        total = 'total_pay = salary + bonus'
        result = personnel.calculate([total])
        joe_rec = result[1]
        assert joe_rec['name'] == 'joe'
        assert joe_rec['total_pay'] == '12345'
        assert joe_rec['double'] == '4690'
        sam_rec = result[2]
        assert sam_rec['name'] == 'sam'
        assert sam_rec['total_pay'] == '54321'
        assert sam_rec['double'] == '8040'

Not going to commit that: it’s not green yet.

We need calculate:

I’ve done something wrong. Roll back, do again, differently. This time I’ll move XCalculation to src first. I think I’ll move it into the same file as the parser and token. That works. Green. Commit: Move XCalculation to src tree.

Now then, as I was saying, remove the skip mark and make the test run.

I stumble a bit. First, I seem to have done something odd with my imports, so I have to do this:

    def calculate(self, expressions):
        from x_calculation import XCalculation
        x_calc = XCalculation(self, expressions)
        return XSet(x_calc)

If I put that import at the top of the file, pytest can’t build the tests, which usually means a circular import somewhere. Yes, XCalculation does create an XSet as part of its work:

    def create_calculated_values(self, record):
        from xset import XSet
        calculated = ((calc.result(record), calc.scope()) for calc in self.expressions)
        return XSet.from_tuples(calculated)

In looking at this I realize that somewhere earlier I said that XCalculation streamed its results. We see here that it does not. Make a card to fix that.

I also found that I had managed to put XCalculation into two files. That’s one too many. Fixed that.

Just: Did I mention “just”? I think I did …

Now we have one more issue and it is rather important. My tests used to run in about 150 ms. With the new streaming select, they are taking 3 whole seconds. All the time is spent in this test:

    def test_waste_memory(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ee = XSet(ff)
        assert ee.cardinality() == 1000
        jeffries = ee.select(lambda e, s: e.includes('jeffries', 'last'))
        assert jeffries.cardinality() == 200
        ron = ee.select(lambda e, s: e.includes('ron', 'first'))
        assert ron.cardinality() == 100
        coder = ee.select(lambda e, s: e.includes('coder', 'job'))
        assert coder.cardinality() == 200
        high = ee.select(lambda e, s: e.includes('12000', 'pay'))
        assert high.cardinality() == 250
        ron_jeffries = ron.intersect(jeffries)
        assert ron_jeffries.cardinality() == 20
        high_coder = coder & high
        assert high_coder.cardinality() == 50
        final = ron_jeffries & high_coder
        assert final.cardinality() == 1
        assert len(final) == 1
        employee, scope = final.pop()
        assert employee.includes('jeffries', 'last')
        assert employee.includes('ron', 'first')
        assert employee.includes('coder', 'job')
        assert employee.includes('12000', 'pay')

What is happening? Well, we are selecting sequentially from all the records (which are on a file), going through a series of select operations, each of which winnows down the file a bit, until we have a select that winds up with one record. And then we ask those sets for their cardinality, which will read all the records to count them. And we intersect them in various combinations, again doing all the reads down to the bottom.

Let’s commit, with that test marked to skip. Commit: XSet calculate implemented. Slow test due to streaming select marked skip.

Now let’s settle down and reflect.

Reflection

My tests run automatically, and there is no actual check for the overall timing, so as long as they’re saying green when I think they should, I don’t necessarily notice how long they’re taking. So the change in that one test slipped by me for a bit. I’ve made a not to look into whether there’s a way to test the overall test timing in pytest. I’d add a check for that if there is a reasonable way.

What about the actual issue? In the test in question, we create four different selects against a thousand record file-based set. We ask each one for its cardinality, reading all the records of the main set each time. then we intersect two of them, which will read the underlying file a lot:

    def intersect(self, other):
        return XSet.from_tuples((e, s) for e, s in self if other.includes(e, s))

For every record in the one select, we’ll read all the records of the other. Since the select is reading down to the bottom, I think this is going to do a million reads on each intersection.

Let’s see if we can find out how many reads we do. I add a class variable and increment it:

class XFlatFile(XImplementation):
    read_count = 0

    def get_record(self, index):
        seek_address = index*self.record_length
        with open(self.full_file_path, "r") as f:
            self.__class__.read_count += 1
            f.seek(seek_address)
            rec = f.read(self.record_length)
        return rec

I zero it at the beginning of the slow test and assert it at the end:

        assert XFlatFile.read_count == 261746

Naturally I got that value by running the test. So that one test does 271000 file open, read, close cycles. You can see how that might be slow.

I mark it to skip again and commit.

Now the point of that test was to be wasteful. We have a test that gets the result more like the right way:

    def test_do_not_waste_memory(self):
        XFlatFile.read_count = 0
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ee = XSet(ff)

        def sel(e, s):
            return (e.includes('jeffries', 'last') and
                    e.includes('ron', 'first') and
                    e.includes('coder', 'job') and
                    e.includes('12000', 'pay'))

        final = ee.select(sel)
        assert final.cardinality() == 1
        employee, scope = final.pop()
        has_jeffries = employee.includes('jeffries', 'last')
        assert has_jeffries
        assert employee.includes('ron', 'first')
        assert employee.includes('coder', 'job')
        assert employee.includes('12000', 'pay')
        assert XFlatFile.read_count == 1017

That only does 1017 reads. Interestingly, if we do not ask for the cardinality and just pop the first record, it only does 16 reads. Why? Because there is only one and it is near the top of the file, and XSet.pop() just doe a next and stops. If we do the cardinality, we’ll keep reading the set to see if there are other records meeting the selection. There are not, but the reads are done anyway.

So this is interesting and probably important: if we cascade selects, they will iterate over the file a lot. Similarly if we have two sets representing, say, an intersection, we’ll iterate the second set a lot, because we have a very naive implementation of intersect.

What can we ever do? Are we in big trouble, guys?

I don’t think we’re in big trouble, although it is clear that there’s some learning to be done here. We do have options. We can probably figure out a way to cache results of a select the first time it is run, although we wrote the streaming one explicitly to avoid creating the intermediate set. We can readily have a set operation that explicitly creates a physical copy of an input set. And we can certainly do a better thing with an XFlatFile than read single records with an open/close on each one. I mean, we’ve heard of buffering, right?

Summary

I’ve written four notes:

Select can get slow, especially over files;
Buffer?
Change XCalc to stream
Some say to check overall test time?

We do have a surprising result, which is that misused, a select over a file can generate a cubic meter of reads with no trouble at all. I don’t think we should worry about that much, but it is on the list.

I am pleased with how things went, despite the small issue of the 271,000 file accesses. We’ll figure something out for that. All in all, these powerful changes are going in quite smoothly.

See you next time!

Danger, Will Robinson, danger! Don’t trust any idea that relies on “just”. ↩