FAFO on GitHub

The statistics code is much better now. But a little widget might be just the thing. Turns into a medium-sized widget, but I think it’s an improvement.

Hello, friends!

The XSet statistics method goes like this:

    def statistics(self, fields):
        makers = []
        for field in fields:
            makers.append(StatsMaker(field))
        for record, _s in self:
            for maker in makers:
                maker.record(record)
        key, _scope = self.pop()
        for maker in makers:
            stats = maker.statistics()
            key = key | stats
        return key

Compared to what we started with this morning, that’s quite the improvement. But I think we can do better.

My plan is to change this code in place, and then make it work, with no additional tests. Why? Because I think it’ll go just fine.

I think we may want to rename the StatsMaker class but let’s wait and see. Here’s my new code:

    def statistics(self, fields):
        maker = StatisticsMaker(fields)
        for record, _s in self:
            maker.record(record)
        return maker.statistics()

The plan is that the new StatisticsMaker holds the StatsMakers (soon to be renamed to StatsAccumulator) and passes all the records by them, and interrogates them to get their results.

And let’s just code this baby right up:

class StatisticsMaker:
    def __init__(self, fields):
        self._accumulators = [StatsMaker(field) for field in fields]
        self._key = None

    def record(self, xSet):
        if not self._key:
            self._key = xSet
        for accumulator in self._accumulators:
            accumulator.record(xSet)

    def statistics(self):
        result = self._key
        for accumulator in self._accumulators:
            result = result | accumulator.statistics()
        return result

And we are green. That’s more code, but I like it better. We’ll talk about that below. Let’s do rename the StatsMaker, to StatsAccumulator.

class StatisticsMaker:
    def __init__(self, fields):
        self._accumulators = [StatsAccumulator(field) for field in fields]
        self._key = None

That’s that. Commit: Replace list with StatisticsMaker.

I think we could do with a few more tests, not least getting rid of the detailed values from the key record, which are presently included in the result.

Let’s add a test to the TestStatsObject class, which has room for it.

    def test_statistics_maker_removes_fields(self):
        rec1 =  XSet.from_tuples((('it', 'department'), ('serf', 'job'), (1234, 'pay')))
        rec2 =  XSet.from_tuples((('it', 'department'), ('serf', 'job'), (4321, 'pay')))
        maker = StatisticsMaker(['pay'])
        maker.record(rec1)
        maker.record(rec2)
        stats = maker.statistics()
        assert stats['pay_mean'] == 2777.5
        assert stats['pay'] is None

This fails with pay == 1234, just as we all expected.

Expected :None
Actual   :1234

Se need to remove all the fields we are recording from the key record. How can we do that?

We can re-scope the key record with only the keys that are not in our list. Like this. (Look in record for the actual re-scoping.)

class StatisticsMaker:
    def __init__(self, fields):
        self._accumulators = [StatsAccumulator(field) for field in fields]
        self._scopes = self.make_scopes(fields)
        self._key = None

    def make_scopes(self, scopes):
        from xset import XSet
        tuples = [(scope, scope) for scope in scopes]
        return XSet.from_tuples(tuples)

    def record(self, xSet):
        if not self._key:
            all_scopes = xSet.scope_set()
            desired_scopes = all_scopes - self._scopes
            self._key = xSet.re_scope(desired_scopes)
        for accumulator in self._accumulators:
            accumulator.record(xSet)

That’s green. Commit: statistics set no longer includes statistics fields.

That’s nearly good, but I think we should extract a method to make the key set:

    def record(self, xSet):
        if not self._key:
            self.make_key_set(xSet)
        for accumulator in self._accumulators:
            accumulator.record(xSet)

    def make_key_set(self, xSet):
        all_scopes = xSet.scope_set()
        desired_scopes = all_scopes - self._scopes
        self._key = xSet.re_scope(desired_scopes)

Green. Commit: refactoring.

Cross that item off the card and throw the card away. Woot!

I think we should have a test that shows what happens if we do not supply any detail records. I think we return None and that may not be quite the thing.

    def test_statistics_maker_returns_null_with_no_input(self):
        maker = StatisticsMaker(['pay'])
        stats = maker.statistics()
        assert stats == XSet.null

Fails, but not as I had anticipated:

    def mean(self):
>       return self._sum / self._count
E       ZeroDivisionError: division by zero

We asked a question we should not have asked, so we check here:

    def statistics(self):
        from xset import XSet
        if not self._key:
            return XSet.null
        result = self._key
        for accumulator in self._accumulators:
            result = result | accumulator.statistics()
        return result

Commit: return null set if no records provided.

I don’t love that. I think we can do better. Let’s look at the whole class:

class StatisticsMaker:
    def __init__(self, fields):
        self._accumulators = [StatsAccumulator(field) for field in fields]
        self._scopes = self.make_scopes(fields)
        self._key = None

    def make_scopes(self, scopes):
        from xset import XSet
        tuples = [(scope, scope) for scope in scopes]
        return XSet.from_tuples(tuples)

    def record(self, xSet):
        if not self._key:
            self.make_key_set(xSet)
        for accumulator in self._accumulators:
            accumulator.record(xSet)

    def make_key_set(self, xSet):
        all_scopes = xSet.scope_set()
        desired_scopes = all_scopes - self._scopes
        self._key = xSet.re_scope(desired_scopes)

    def statistics(self):
        from xset import XSet
        if not self._key:
            return XSet.null
        result = self._key
        for accumulator in self._accumulators:
            result = result | accumulator.statistics()
        return result

How can we clean that up? We could change the StatsAccumulator so that it doesn’t divide by zero if it has seen no records. We want that in any case. I make a card to write the test. Then, in a fit of being a good person, I write the test:

    def test_stats_given_record_without_field(self):
        stats = StatsAccumulator("pay")
        record = XSet.from_tuples(((1000, 'pork'),))
        stats.record(record)
        assert stats._count == 0
        assert stats._sum == 0
        stats = stats.statistics()
        assert stats == XSet.null

Fails, until we do this:

Well, first, I think further. Do we really want a null set if we see no values. Or should we return zeros for all the stats? I think we should. Change the test. I’m glad I wrote it.

    def test_stats_given_record_without_field(self):
        stats = StatsAccumulator("pay")
        record = XSet.from_tuples(((1000, 'pork'),))
        stats.record(record)
        assert stats._count == 0
        assert stats._sum == 0
        assert stats.mean() == 0

I am moving too fast. I have something I want to do after this and I am rushing. Calm down.

class StatsAccumulator:
    def mean(self):
        return self._sum / self._count if self._count else 0.0

That test is good. Now let’s get back to the null set one.

    def test_statistics_maker_returns_null_with_no_input(self):
        maker = StatisticsMaker(['pay'])
        stats = maker.statistics()
        assert stats == XSet.null

We have no key values, but we do know that we are supposed to be doing stats on pay. Should we perhaps return the zeros? No, we should not. If we have seen no data our accumulator should return no results. I think.

class StatsAccumulator:
    def tuples(self):
        tups = []
        if self._count:
            tups.append((self._count, self._name+'_count'))
            tups.append((self._sum, self._name+'_sum'))
            tups.append((self.mean(), self._name+'_mean'))
        return tups

Now we return a null set if we have seen no records in a given accumulator. And therefore in the maker:

    def statistics(self):
        result = self._key
        for accumulator in self._accumulators:
            result = result | accumulator.statistics()
        return result

If we see no records, self._key is still null. And all the accumulators will return null. So we get null.

Now what about one accumulator getting information and another not?

Let’s have a test.

    def test_one_accumulator_gets_no_data(self):
        rec1 =  XSet.from_tuples((('it', 'department'), ('serf', 'job'), (1234, 'pay')))
        maker = StatisticsMaker(['pay', 'bonus'])
        maker.record(rec1)
        stats = maker.statistics()
        assert stats['pay_sum'] == 1234
        assert stats['bonus_sum'] is None

That passes, as I am sure we all expected. I think that’s a wrap. Commit: improvements to statistics edge cases.

Let’s have a little break and then sum up.

Summary

Best of all, I was really righteous about thinking of a test and then writing it immediately rather than deferring it until the code already worked. I’d do well to lean on that habit a bit more.

We have notably more code than we came in with, but we have a simpler basic stats method, and much better handling of cases where there is no data provided, either no records to accumulate, or specific records missing certain fields. (We have no intention of blessing that idea, but it might be interesting if only some people had a bonus or something.)

Changes all went smoothly, even the slightly more risky step of using the existing tests to test the initial implementation of the new StatisticsMaker class.

We do have 32 lines of code in the StatisticsMaker, reducing XSet.statistics from a dozen lines down to five. Not exactly a savings, but putting the edge cases into the original method … probably would have gotten pretty grubby and close to what we have here. Now, I think each class is pretty simple.

A thought comes to mind.
Instead of checking to see if we have a key set, we could unconditionally replace the key set every time we see a record, having initialized to null. That might simplify the code a bit but at the cost of doing more work. No, I think that would be less clear, though it would work. Belay that idea.

Overall, I think this is an improvement. Do you agree? Feel free to toot at me over on mastodon dot social.

See you next time!