FAFO on GitHub

OK with that out of my system, what shall we code today? It goes so nicely, until oops.

Hello, friends.

Having expressed my general ennui in this morning’s first article, I plan now to look at the code and see what we might want to work on. There’s always something.

Inefficient Operations

We have a number of operations that, while compact and pretty clear, are not efficient, at least at some scales. Here are a few examples:

    def group_by(self, a_scope):
        projector = XSet.classical_set((a_scope,))
        projected = self.project(projector)
        group = []
        for e, s in projected:
            name = e[a_scope]
            values = self.select(lambda detail, _s: detail[a_scope] == name)
            obj = GroupHolder(name, values)
            group.append(obj)
        return sorted(group, key=lambda x: x.name)

This may be the worst. For every unique key value in the scope being grouped upon, this operation does a complete scan of the input set, selecting all the elements containing that value. If you have 20 departments in your division, we’ll scan the entire employee database 20 times to create the group holders.

    def diff(self, other):
        """
        set difference or relative complement
        :param other: any set
        :return: self - other
        """
        mine = set((e, s) for e, s in self)
        others = set((e, s) for e, s in other)
        remaining = mine - others
        return XSet.from_tuples(remaining)

Here we do an interesting thing. We create two Python sets with the tuples of the two input sets, and rely on Python set difference to do the job. Issues include:

  • Some XSets are already stores in frozenset form. We need not convert them in that case, we could just difference the frozen sets.

  • We need not explode the set into (e, s), we could just leave them as tuples. We’ll do that below in just a moment.

  • The from_tuples method converts to a frozen set, which may be OK, and then checks all the tuples for being 2-tuples, which is surely wasteful in this case.

  • The sets in question could be too large for Python to handle this way. If they were, the program would crash.

Let’s fix the tuples concern:

    def diff(self, other):
        """
        set difference or relative complement
        :param other: any set
        :return: self - other
        """
        mine = set(t for t in self)
        others = set(t for t in other)
        remaining = mine - others
        return XSet.from_tuples(remaining)

Commit: tiny optimization.

We do have a nascent convention for allowing implementations to provide a more specific implementation of an XSet method if they wish. For example:

    def rename_contents(self, re_scoping_set: Self):
        try:
            return self.implementation.rename_contents(re_scoping_set)
        except AttributeError:
            new_tuples = ((e.rename(re_scoping_set), s) for e, s in self)
            return XSet.from_tuples(new_tuples)

So let’s see about providing a diff operation for XFrozen implementations.

We only have one test for diff:

    def test_diff(self):
        a = XSet.from_tuples((('first', 'first_name'), ('last', 'last_name')))
        b = XSet.from_tuples((('last', 'last_name'), ))
        diff = a.diff(b)
        assert diff.includes('first', 'first_name')
        assert diff.excludes('last', 'last_name')

Happens that both these sets will be frozen set. Let’s create a diff test that is not frozen-set based.

    def test_diff_not_frozen(self):
        a = XSet.n_tuple((10, 2, 30))
        b = XSet.from_tuples(((2, 2), ))
        diff = a.diff(b)
        assert diff.includes(10, 1)
        assert diff.includes(30, 3)
        assert diff.excludes(2, 2)

That’s running green. Commit: new test.

Now we refactor. Is this refactoring, you ask? Well, yes, because when we’re done, all the behavior will be the same. Of course it’s all new code, so it isn’t QUITE like your usual extract this sort of thing. In the end, the question is moot. What we’re going to do is change the implementation and keep the tests running.

More accurately, get the tests running again, but let’s see how well we can do keeping them green throughout.

We propose to add diff to XFrozen. Let’s first change XSet’s diff once and for all:

    def diff(self, other):
        """
        set difference or relative complement
        :param other: any set
        :return: self - other
        """
        try:
            return self.implementation.diff(other)
        except AttributeError:
            mine = set(t for t in self)
            others = set(t for t in other)
            remaining = mine - others
            return XSet.from_tuples(remaining)

Still green. Commit: diff now tries implementation diff.

Now in XFrozen:

    def diff(self, other):
        raise AttributeError

Green. Commit: XFrozen diff always raises AttributeError.

OK, now I have to get serious here.

    def diff(self, other):
        if True:
            raise AttributeError
        else:
            diff = self.data - other.data
            return XSet(frozenset(diff))

Green. Commit: moving toward frozen diff.

    def diff(self, other):
        from xset import XSet
        if not isinstance(other.implementation, self.__class__):
            raise AttributeError
        else:
            diff = self.data - other.data
            return XSet(frozenset(diff))

Green. Commit: XFrozen has own implementation of diff.

Of course we have no proof that our implementation is ever being used. We need a test for the other side anyway:

    def test_diff_not_frozen_other_way(self):
        a = XSet.from_tuples(((10, 10), (2, 2), (30, 3)))
        b = XSet.n_tuple((10, 2))
        diff = a.diff(b)
        assert diff.excludes(10, 1)
        assert diff.excludes(2, 2)
        assert diff.includes(30, 3)

They are all passing. How can I be sure my code is ever running?

Arrgh! In trying to find a clever way to make sure my code is running, I have found that it is failing, every time, with AttributeError, so that the original XSet version runs. The good news is that it recovers and gets the right answer. The bad news is that my XFrozen implementation is borked and I don’t see why.

A bit more thinking than it should have needed gives me this:

    def diff(self, other):
        from xset import XSet
        if not isinstance(other.implementation, self.__class__):
            raise AttributeError
        else:
            diff = self.data - other.implementation.data
            impl = self.__class__(diff)
            return XSet(impl)

I was passing the frozenset that comes back from the diff directly to XSet, and it has to be wrapped in an XFrozen.

But we still have no real proof that we’re ever using the frozen implementation. We can change the implementation to make it fail:

    def diff(self, other):
        from xset import XSet
        if not isinstance(other.implementation, self.__class__):
            raise AttributeError
        else:
            diff = self.data + other.implementation.data
            impl = self.__class__(diff)
            return XSet(impl)

Lots of things fail. Setting the + back to - fixes them. We’d need something weird to directly test that we actually reach that code. Even if we pull it out as a separate method, it would be difficult without a monkey patch or something equally weird.

For now, I’m convinced that it does in fact work.

Commit: XFrozen implements diff. Really, this time.

Reflection

It was going very nicely, green to green to green, right up until I realized that somehow my code was falling back to the fallback implementation in every case. Then I went ragged for a bit. The error was subtle, because it happened that my mistake threw the exception that I was looking for. Hey! What we could do is to make a custom exception, NotHandled or something, and then any other error would look like a test failure.

That’s such a good idea that I’ll do it next time I come to bat.

A larger question is whether this was worth doing at all. Well, from a performance viewpoint, we had no measurement to show that we needed the optimization, so making the system more complicated to gain this improvement isn’t what you could call a clear win. On the other hand, we did learn a few things, and a good idea has come out. So that’s nice.

Possibly what we “really” should do is to come up with a design where all the set operations automatically try their implementations and only fall back when they must. I think that even with only as many implementations as we have now, that might get cumbersome if we tried to deal with the combinations. That said, with everything currently coming down to the XFrozen implementation, we might do well to at least handle that case, as we did here, here both operands are frozen.

Viewing this morning’s exercise as an experiment, I’d say that the team should sit down and talk about what has been learned, and decide how to proceed. Options include only improving things known to need it, or devising a standard way to do what we did today, on the grounds that it could be made quite easy and would pay off a bit if not a lot.

That said, while we have a tiny gain here, we have the group_by possibility hanging right there, where we can save multiple passes over the entire input set. We can double, quadruple, mumbletuple the speed of group_by if we can come up with a one-pass way of doing it. That would more clearly be worth it.

And what we might really do is to devise a standard way to write timing tests so that we can focus our efforts on things that really matter, and then work up a candidate list, do a few tests, and go from there.

Summary

I was doing so nicely right up until that misstep that caused a fall back to the base method. I am so tempted to change the article to make it look as if I did it right all the way. But no: my point here is to show you that try as we might, we do mess up sometimes, and we always manage to recover. I just wish I were not quite so good at teaching that particular lesson.

See you next time!