FAFO on GitHub

Fancy as my new streaming select is, I think it’s, well, not quite the thing. Easily replaced, but do I need to pivot? I surprise myself.

Aside from the sheer cleverness of doing it, my rationale for the streaming select operation was that it would save memory, because the old version created the set of selected elements as a concrete set, and the new version does not. Two matters make me think this was a mistake.

First, I successfully created a set containing a billion records in memory, leading me to believe that saving memory is perhaps not as high a priority as it might be.

Second, if the result of a select is used more than once, the entire select is run again, redundantly, wasting execution time. I have a test that does that.

Third—I was mistaken about the “two”—operators like union, intersection, difference, and symmetric difference all need to check whether a given element of one set is or is not in another set. This is slow enough if the set is in memory. Doing such an operation to a selection reruns the entire selection, once for each element in the first set!

This is bad.

Added in Post
I will shortly discover that the above isn’t quite true. Only intersection has this terrible property. The others are just fine.

Relatedly, some of our current algorithms for the operations mentioned above are quite slow. Here they are:

class XSet:
    def diff(self, other):
        """
        set difference or relative complement
        :param other: any set
        :return: self - other
        """
        mine = set((e, s) for e, s in self)
        others = set((e, s) for e, s in other)
        remaining = mine - others
        return XSet.from_tuples(remaining)

    def intersect(self, other):
        return XSet.from_tuples((e, s) for e, s in self if other.includes(e, s))

    def sym_diff(self, other):
        return (self - other) | (other - self)

    def union(self, other):
        mine = set(self)
        others = set(other)
        both = mine | others
        return XSet.from_tuples(both)

Hm. One of these is not like the others. But Python sets do include intersection or &.

Small Aha
In looking to be sure Python did have intersection, I learned something that I didn’t know about Python sets. You can express the union of two sets either as a.union(b) or a|b. But there is a difference. In the case of union, the parameter can be any iterable and Python will deal with it. With the operator |, both operands must already be sets.

This is why we let the code participate in our design sessions. But I digress.

The horrendously slow test uses intersection. If we did intersection as we did the others, might it be notably faster? Let’s find out. Here’s the test:

    @pytest.mark.skip("timing")
    def test_waste_time(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ee = XSet(ff)
        assert ee.cardinality() == 1000
        jeffries = ee.select(lambda e, s: e.includes('jeffries', 'last'))
        assert jeffries.cardinality() == 200
        ron = ee.select(lambda e, s: e.includes('ron', 'first'))
        assert isinstance(ron.implementation, XSelect)
        assert ron.cardinality() == 100
        coder = ee.select(lambda e, s: e.includes('coder', 'job'))
        assert coder.cardinality() == 200
        high = ee.select(lambda e, s: e.includes('12000', 'pay'))
        assert high.cardinality() == 250
        ron_jeffries = ron.intersect(jeffries)
        assert ron_jeffries.cardinality() == 20
        assert isinstance(ron_jeffries.implementation, XFrozen)
        high_coder = coder & high
        assert high_coder.cardinality() == 50
        final = ron_jeffries & high_coder
        assert final.cardinality() == 1
        assert len(final) == 1
        employee, scope = final.pop()
        assert employee.includes('jeffries', 'last')
        assert employee.includes('ron', 'first')
        assert employee.includes('coder', 'job')
        assert employee.includes('12000', 'pay')

With that one skipped, tests run this morning in 17 milliseconds. The time seems to vary over the course of the day and week. Perhaps it has to do with whatever else is consuming memory. I run them again, 10 ms. Now let’s allow the slow one to run. 417 ms.

Now let’s replace intersection:

    def intersect(self, other):
        mine = set(self)
        others = set(other)
        both = mine & others
        return XSet.from_tuples(both)

Time is 20 ms. I’d say that’s better. Dropped that test from 400 ms to 10 ms. Factor of 40, unless I miss my guess. Not bad. I wonder whether this would be better still:

    def intersect(self, other):
        mine = set(self)
        both = mine.intersection(other)
        return XSet.from_tuples(both)

No, same. I’ll put it back as more explicit.

Green. Commit: Improvement to intersect improves slow test from 400ms to 10ms.

Reflection

Wow. Interesting discovery. I came in here thinking that I needed to make some major changes, because in the back of my mind, I thought we needed to improve the operations like intersection to have a faster way of determining whether an element was in another set. That was going to push me in the direction of sorting the flat file sets so that we could do a merge operation for such things.

And that was going to create a lot more work, and I was even thinking of a major change to make flat file sets the primary data structure, and I was thinking about how to do that and whether we could retain our current generality but with these high-performance flat sets where possible …

And a four line change reduces my concern by a factor of 40.

I still think, however, that the streaming select is not a good idea. Let’s see if we can get the old one back.

Ah. Happens that I hadn’t removed it yet:

    def select(self, cond) -> Self:
        from test_x_select import XSelect
        x_sel = XSelect(self, cond)
        return XSet(x_sel)

    def old_select(self, cond) -> Self:
        tuples = ((e, s) for e, s in self if cond(e, s))
        return XSet.from_tuples(tuples)

Let’s just rename those. There might still be a use for the streamer someday.

    def select(self, cond) -> Self:
        tuples = ((e, s) for e, s in self if cond(e, s))
        return XSet.from_tuples(tuples)

    def streaming_select(self, cond) -> Self:
        from test_x_select import XSelect
        x_sel = XSelect(self, cond)
        return XSet(x_sel)

And our time-wasting test fails, because it checks to be sure it’s using an XSelect along the way. I change it to do only streaming_select operations, just to keep the method in use. I doubt we’ll ever choose to use it but for now we’ll let it live.

Commit: replace streaming select with non-streaming.

Let’s sum up.

Summary

I think that it was in fact a mistake to do the streaming select. I should kill my darlings, and if you were here you’d talk me into it, but I am weak, so I’ll save the method for now. It was a mistake because while we could save a little space and possibly a little time on a single select, cascaded selects repeat the work, resulting in a substantial slowdown in some cases.

The good news is that looking in that area caused us to learn a lot. A couple of simple experiments showed that reading more than one record at a time was better than just reading one. (Of course we knew that, but our tests showed us how important it was.)

Somehow that got us to wondering how much memory we could use, and we created a set of a billion records and Python processed it just fine. That gave us the courage to just read our flat file sets into memory. It leaves open a concern, which is the possibility of a set that is too large to fit.

Oh!

It also comes to me that with the change back to the non-streaming select, a selection on a flat file wouldn’t be all that much slower if we were to read it in chunks, as we did in earlier experiments. It would take a few reads instead of one big one, but it would still be only a few, and they would be rare. So we might want to back away a bit from reading the whole thing in. At a minimum, we can see that there is a fall-back position that we could take.

Another possibility, should it ever matter, would be to create a very large but fixed buffer pool and put that underneath our file-based sets. That could pretty easily be turned to work well but without the danger of overflowing memory with too many billion-record sets.

Finally, we knew that intersection was a problem. However, we (OK, I, you ad nothing to do with the mistake) had forgotten that intersection used a particularly slow approach in its implementation, and referring the implementation down to Python’s set objects sped things up by a facto of 40.

That made the whole concern go away. I am pleased, because although there might be value in exploring some faster mechanisms, it’s best to be able to rely on existing structures and processes, so that we can focus on capability rather than performance.

Learning

There are some lessons here.

First, and this is a big one: let our code participate in our design sessions. My speculation about what I might have to do about slow intersection was interesting but the changes would have been somewhere between substantial and profound. A review of a few lines of code suggested an experiment and the experiment made the concern go away.

Second, relatedly, performance improvements are tricky. No, more tricky than that. No, even more tricky. I recall Kent Beck injecting the phrase “a performance measurement having shown …” into discussions of how we might “optimize” something. I relearn the lesson often: when working to improve performance, I need to provide some tests that allow me to determine the impact of my proposed changes.

The result of having those tests today is a perfect example. Instead of the deep and complicated changes that I was thinking about, a simple change sped things up by such a large factor that the issue is just gone.

An outstanding result, and entirely different from what I expected only an hour and 40 minutes ago!

See you next time!