Statistics Cleanup

FAFO on GitHub

The statistics code runs, but is written in 1960’s style—yes, I DO know what 1960’s code looks like—and needs improvement. An interesting series of events with a fine result.

Hello, friends!

Here’s the statistics method as you last saw it:

    def statistics(self, fields):
        statistics = {}
        for field in fields:
            statistics[field] = { field+'_count': 0, field+'_sum': 0}
        for e, s in self:
            for field in fields:
                value = e[field]
                count_name = field+'_count'
                sum_name = field+'_sum'
                entry = statistics[field]
                entry[count_name] = entry[count_name] +1
                entry[sum_name] = entry[sum_name] + value
        new_tuples = []
        for field_name, field_values in statistics.items():
            for name, value in field_values.items():
                new_tuples.append((value, name))
            sum_name = field_name+"_sum"
            count_name = field_name+"_count"
            mean_name = field_name+"_mean"
            new_tuples.append((field_values[sum_name] / field_values[count_name], mean_name))
        new_items = XSet.from_tuples(new_tuples)
        key, _scope = self.pop()
        result = key | new_items
        return result

Later yesterday, while not writing a contemporaneous article at the same time, I tried a few refactoring and tidying steps, and now it looks like this:

    def statistics(self, fields):
        def count_field(field_name):
            return field_name+'_count'

        def sum_field(field_name):
            return field_name+'_sum'

        def mean_field(field_name):
            return field_name+'_mean'

        statistics = {}
        for field in fields:
            statistics[field] = {count_field(field): 0, sum_field(field): 0}
        for record, _s in self:
            for field in fields:
                entry = statistics[field]
                entry[count_field(field)] += 1
                entry[sum_field(field)] = entry[sum_field(field)] + record[field]
        new_tuples = []
        for field, values in statistics.items():
            new_tuples.append((values[count_field(field)], count_field(field)))
            new_tuples.append((values[sum_field(field)], sum_field(field)))
            new_tuples.append((values[sum_field(field)] / values[count_field(field)], mean_field(field)))
        new_items = XSet.from_tuples(new_tuples)
        key, _scope = self.pop()
        result = key | new_items
        return result

There’s nothing terribly different there. I removed the loop inside the final loop, because it only created the two items count and sum, and inlining it seemed to make more sense. Otherwise, I just made those little helper functions for creating the field names, and renamed some things. Still much the same.

Reflection

What’s not to like about this? Well, a lot of dictionary access: we have a dictionary of dictionaries. It’s all in open code: the dictionaries aren’t really helping us, they’re just containing our results. And a lot of renaming, over and over, a total of about a dozen calls to the renaming functions. Our objects are not helping us and we are suffering from “primitive obsession”, given that a dictionary is a primitive in Python.

What if there was a helper object that did statistics when fed a series of values? It would accumulate count and sum and someday sums of squares and whatnot, and return the stats on request. From the look of the code we have now, it might even return a set of tuples, or a small XSet of scoped stats with the proper name.

Let’s write some tests for that thing, to get a sense of how it might work. I have a lot of tests in the group file, let’s create a new test file for this little guy.

class TestStatsObject:
    def test_stats(self):
        stats = StatsMaker("pay")

This drives out the class. I think it should know the name of the field it is doing stats for, because we intend that it will return them in some useful form, and the names have that clever prefix thing going.

class StatsMaker:
    def __init__(self, name):
        self._name = name
        self._count = 0
        self._sum = 0

Green. Commit: initial StatsMaker.

Continue that initial test:

class TestStatsObject:
    def test_stats(self):
        stats = StatsMaker("pay")
        stats.value(1000)
        stats.value(2000)
        stats.value(3000)
        stats.value(4000)
        assert stats._count == 4

Fails, of course, for want of value. We code—OK, just to show you I can do it if I want to—the absolute minimum:

    def value(self, number):
        self._count += 1

Green. Commit: StatsMaker can count.

My point is, I often take steps that are larger than the smallest possible one. I try to take steps that are small enough, but also large enough. I may be mistaken in doing that. It’s quite possible that taking the smallest step I can think of is better. I don’t know: I don’t do it often enough to find out. My thoughts are automatically a bit larger than minimal, as are most everyone’s, and I don’t always put in the effort to trim them down. Perhaps I’d do better if I did. Anyway … let me ask for the sum now, and I’m going to do it in the same test.

    def test_stats(self):
        stats = StatsMaker("pay")
        stats.value(1000)
        stats.value(2000)
        stats.value(3000)
        stats.value(4000)
        assert stats._count == 4
        assert stats._sum == 10000

And of course:

    def value(self, number):
        self._count += 1
        self._sum += number

Let’s take a big step now and have the thing create the set we want. Well since we were thinking of smaller steps we might take, let’s do that test but then perhaps belay it and see if we can do a simpler one.

    def test_stats_result_set(self):
        stats = StatsMaker("pay")
        stats.value(1000)
        stats.value(2000)
        stats.value(3000)
        stats.value(4000)
        result = stats.statistics()
        assert result['pay_count'] == 4
        assert result['pay_sum'] == 10000
        assert result['pay_mean'] == 5000

Added in Post: The average is not 5000. This will soon nip us gently from behind.

Seriously, I think that’s small enough, although inside the maker we will produce some tuples to turn into a set. Let’s just do it.

    def statistics(self):
        return XSet.from_tuples(self.tuples())

Now that I’ve written that, is it better to test the tuples method directly? I think not. If you were here you might differ and if you did, I’d go with your thinking. As things stand, I’ve only got my thinking, so here goes:

    def tuples(self):
        tups = []
        tups.append((self._count, self._name+'_count'))
        tups.append((self._sum, self._name+'_sum'))
        tups.append((self.mean(), self._name+'_mean'))
        return tups

    def mean(self):
        return self._sum / self._count

The test is not green, but I think I know why. 5000 does not equal 5000.0. Let’s see:

LOL!

Expected :5000
Actual   :2500.0

Nice work on my hand-calculating the mean. I divided 1000 by 2, not 4. Good thing I have a computer here. Fix the test:

    def test_stats_result_set(self):
        stats = StatsMaker("pay")
        stats.value(1000)
        stats.value(2000)
        stats.value(3000)
        stats.value(4000)
        result = stats.statistics()
        assert result['pay_count'] == 4
        assert result['pay_sum'] == 10000
        assert result['pay_mean'] == 2500.0

Commit: StatsMaker makes XSet containing count, sum, and mean, properly named.

Let’s move StatsMaker into the XSet file, as it is just a helper class. This could be a bad idea but I don’t think it is. We can move it again if we need to. No, it’s already 20 lines long. It deserves its own file.

Commit: move StatsMaker to own file.

Now we should be able to use StatsMaker in our statistics method. Let’s review it:

    def statistics(self, fields):
        def count_field(field_name):
            return field_name+'_count'

        def sum_field(field_name):
            return field_name+'_sum'

        def mean_field(field_name):
            return field_name+'_mean'

        statistics = {}
        for field in fields:
            statistics[field] = {count_field(field): 0, sum_field(field): 0}
        for record, _s in self:
            for field in fields:
                entry = statistics[field]
                entry[count_field(field)] += 1
                entry[sum_field(field)] = entry[sum_field(field)] + record[field]
        new_tuples = []
        for field, values in statistics.items():
            new_tuples.append((values[count_field(field)], count_field(field)))
            new_tuples.append((values[sum_field(field)], sum_field(field)))
            new_tuples.append((values[sum_field(field)] / values[count_field(field)], mean_field(field)))
        new_items = XSet.from_tuples(new_tuples)
        key, _scope = self.pop()
        result = key | new_items
        return result

I don’t see a simple way to do this in a small series of steps. I rather hate to replace this many lines at once, but at least right now, I don’t see it. Maybe before I get it right.

Here goes.

I start with this much:

    def statistics(self, fields):
        statistics = {}
        for field in fields:
            statistics[field] = StatsMaker(field)

And I immediately see that I don’t like how StatsMaker works. It expects to be handed the value, but what we have here is a set of records. Roll back and enhance StatsMaker.

    def test_stats_given_record(self):
        stats = StatsMaker("pay")
        record = XSet.from_tuples(((1000, 'pay'),))
        stats.record(record)
        assert stats._count == 1
        assert stats._sum == 1000

Now we need a record method:

    def record(self, xset):
        self.value(xset[self._name])

That’s all there is to it, but I’m glad I did that test. Commit: StatsMaker fetches value from record.

I make a note: StatsMaker what if no such field? We’ll come back to that, I fondly hope.

Should we fix it now? What should we do? OK, we have nothing else hanging, write another test:

    def test_stats_given_record_without_field(self):
        stats = StatsMaker("pay")
        record = XSet.from_tuples(((1000, 'pork'),))
        stats.record(record)
        assert stats._count == 0
        assert stats._sum == 0

This requires: first extract variable:

    def record(self, xset):
        value = xset[self._name]
        self.value(value)

And then condition the behavior:

    def record(self, xset):
        value = xset[self._name]
        if value:
            self.value(value)

Green. Commit: StatsMaker ignores records that don’t carry its field.

Cross item off card. I love when that happens.

OK, now that we feel all morally upright, back to statistics:

    def statistics(self, fields):
        statistics = {}
        for field in fields:
            statistics[field] = StatsMaker(field)
        for record, _s in self:
            for field in fields:
                entry = statistics[field]
                entry[count_field(field)] += 1
                entry[sum_field(field)] = entry[sum_field(field)] + record[field]
        new_tuples = []
        for field, values in statistics.items():
            new_tuples.append((values[count_field(field)], count_field(field)))
            new_tuples.append((values[sum_field(field)], sum_field(field)))
            new_tuples.append((values[sum_field(field)] / values[count_field(field)], mean_field(field)))
        new_items = XSet.from_tuples(new_tuples)
        key, _scope = self.pop()
        result = key | new_items
        return result

PyCharm is displaying a lot of red, because I removed those helper functions and such. I’ll just go thru and change each section:

Part way through I decide I’ve lost the thread and that I should just make a list not a dictionary, so I roll back again and start anew:

StatsMaker has difficulty importing XSet. I’ll patch that:

    def statistics(self):
        from xset import XSet
        return XSet.from_tuples(self.tuples())

And with this:

    def statistics(self, fields):
        makers = []
        for field in fields:
            makers.append(StatsMaker(field))
        for record, _s in self:
            for maker in makers:
                maker.record(record)
        key, _scope = self.pop()
        for maker in makers:
            stats = maker.statistics()
            key = key | stats
        return key

We are green. Commit: statistics method uses StatsMaker.

Reflection

We can ship this: it’s passing all our tests and it’s discernibly better. But …

If we were to inspect the result set carefully, we would find that, in addition to pay_count and pay_mean, it also has pay, with some random value in it. We would probably prefer that not to be the case.

I am irritated, somehow, by that list of makers. I think that roughly that code needs to be somewhere, but I am not comfortable with it here in the set operation. I think what we have in hand is not aa StatsMaker so much as a StatisticsAccumulator and that we want another object that manages a batch of StatisticsAccumulators.

We will save those for another time. They’ll require a few more tests and a bit more work. For now, the problem is tamed: the statistics generation is in the StatsMaker, and the field list management is in the statistics method. Close enough to ship. Which we have already done.

I am tempted, so tempted, to do the new object right here. But no, I’ve been at it for two hours of thinking, coding, and writing, and I haven’t even had my chai yet. Let’s sum up:

Summary

We had a complicated statistics method that did the job, using nested dictionaries and a lot of dictionary manipulation. We TDD’d a small object to handle accumulating statistics, started to plug it in, realized that we wanted it to be a bit smarter, rolled back and made it so.

In doing that, we realized that we need to deal with absent data, so we wrote another test and adjusted the StatsMaker.

We began again to plug in StatsMaker and lost the thread part way through. Another rollback. I am quite proud of myself for calling for two rollbacks in this situation. It’s so often the thing to do.

Aside: My colleagues have two different practices in this situation, Hill will git reset --hard, rolling back everything. But Brian uses git stash, which, I gather, gets you back to the last commit, but saves the existing work so that you can use it if you want to. I’ve not tried that. I notice that PyCharm has “shelve” which is like stash but (I think) it doesn’t touch Git at all. Bryan also uses PyCharm’s local history a lot.; Freely granting that I’ve not tried Bryan’s way, I think I prefer the rollback, as it seems to me to be less tentative, more of a decision. I’d probably be wise to try alternatives to get a better sense of what’s best. One possibility is that I might be more willing to stash. Sometimes I don’t want to roll back because I think that what I’ve got is almost good. I am almost certainly wrong about that, but still, trying things is how we learn.

Anyway, after the second rollback, we ran into an issue with yet another two-way import, fixed that and our new version of statistics worked as intended.

A decent morning’s work, and it’s not even 0900 yet. I’ll go enjoy my chai, finish my morning banana, and perhaps even have a granola bar. What a luxurious life I lead, innit

See you next time!