Group Summary

FAFO on GitHub

Commonly, when we group data by some criteria, we desire summary information about the groups: counts, totals, averages, and so on. We take a couple of nice steps before my brain is fried.

Hello, friends!

I only have a few minutes before I must head out for an appointment in the “real world”, but let’s at least get started on thinking about, and doing, summary information. Since we have a tiny personnel database, with department and job, leading to individuals with pay amounts, we’ll use that to get the information we need.

Conveniently, I hope, our new group_by method, given an input set and a “key” name, returns a series of subsets of the input set, each with one value of the key. All the records from department A, then B, and so on. At least I hope it’s convenient. If it isn’t, well, we will surely have to do something similar.

Note that group_by does not produce a set: it produces an object containing a specific key value, called ‘name’ for some reason, and a set of records, ‘values’, which are all the records from the input set which contain that key value. In our tests so far, we iterate that set, producing reports like this:

        expected = ('Dept: it\n'
                    '    Job: sdet\n'
                    '        3: ole sdet: 10000\n'
                    '        4: sam sdet: 11000\n'
                    '    Job: serf\n'
                    '        1: joe serf: 1000\n'
                    '        2: bob serf: 1100\n'
                    'Dept: sales\n'
                    '    Job: closer\n'
                    '        5: ben closer: 1000\n'
                    '        6: cee closer: 1100\n'
                    '    Job: prospector\n'
                    '        7: don prospector: 10000\n'
                    '        8: fay prospector: 11000')

Let’s see what we might do to create a report that includes, oh, let’s say average pay, as a final line in each department. We’ll write a new test.

    def test_average_by(self):
        report_lines = []
        personnel = self.build_peeps()
        for dept in personnel.group_by('department'):
            report_lines.append(dept.name)
            for job in dept.values.group_by('job'):
                report_lines.append("    " + job.name)
                count = 0
                sum = 0
                for pay in sorted([worker['pay'] for worker, scope in job.values]):
                    count += 1
                    sum += pay
                    report_lines.append("        " + str(pay))
                report_lines.append(f"        avg: {sum/count}")
        report = '\n'.join(report_lines)
        expected = ("it\n"
                    "    sdet\n"
                    "        10000\n"
                    "        11000\n"
                    "    serf\n"
                    "        1000\n"
                    "        1100\n"
                    "sales\n"
                    "    closer\n"
                    "        1000\n"
                    "        1100\n"
                    "    prospector\n"
                    "        10000\n"
                    "        11000")
        assert report == expected

I haven’t changed the expected yet, and I get this result:

('it\n'
 '    sdet\n'
 '        10000\n'
 '        11000\n'
 '        avg: 10500.0\n'
 '    serf\n'
 '        1000\n'
 '        1100\n'
 '        avg: 1050.0\n'
 'sales\n'
 '    closer\n'
 '        1000\n'
 '        1100\n'
 '        avg: 1050.0\n'
 '    prospector\n'
 '        10000\n'
 '        11000\n'
 '        avg: 10500.0') != ('it\n'
 '    sdet\n'
 '        10000\n'
 '        11000\n'
 '    serf\n'
 '        1000\n'
 '        1100\n'
 'sales\n'
 '    closer\n'
 '        1000\n'
 '        1100\n'
 '    prospector\n'
 '        10000\n'
 '        11000')

So that worked just as I had hoped.

Reflection

I was concerned about the summing and dividing, because I didn’t force pay to numeric, although I happen to know that, in the particular set we have, it is numeric. Even so, it is an integer. So we’ll need some kind of gentle coercion for our summary fields, since they may well be stored as strings, as they are in flat files, just for one example.

Another concern is that we almost certainly would like a summary set, not just the ability to create a report. I’m envisioning that our control input will be a list of scopes (field names) and that, given say, ‘pay’, we’ll create a set with ‘pay_count’, ‘pay_sum’, ‘pay_mean’, and so on.

But I must go. See you later …

And I’m back …

I’ve modified the test to have the right expectation, so we are green. Commit: test experiment computes average pay.

Now What?

What if we had a set operation, something like statistics, that, sent to a set of elements containing the mentioned scopes, and a list of scopes (should it be a set? Seems inconvenient, but possibly). The operation returns a set containing a single element containing all the element-scope pairs of the input set, plus additional elements, scoped name_count, name_mean, and so on, for each name in the input scope list.

Let’s write a new test assuming that operation exists:

    def test_statistics(self):
        report_lines = []
        personnel = self.build_peeps()
        for dept in personnel.group_by('department'):
            report_lines.append(dept.name)
            for job in dept.values.group_by('job'):
                report_lines.append("    " + job.name)
                for pay in sorted([worker['pay'] for worker, scope in job.values]):
                    report_lines.append("        " + str(pay))
                stats = job.values.statistics(['pay'])  # just returns the record
                report_lines.append(f"        avg: {stats['pay_mean']}")
        report = '\n'.join(report_lines)
        ...

Writing the test tells me that we probably want the record back, not a set containing the single record. We’ll go with that for now. Test fails, for want of statistics.

I think we’ll try “fake it til you make it” here, to make passing this test easier.

    def statistics(self, fields):
        pay_count = 0
        pay_sum = 0
        for e, s in self:
            pay_count += 1
            pay_sum += e['pay']
        pay_mean = pay_sum/pay_count
        new_items = XSet.from_tuples(((pay_count, 'pay_count'), (pay_sum, 'pay_sum'), (pay_mean, 'pay_mean')))
        key, _scope = self.pop()
        result = key | new_items
        return result

That works. It took a little longer than it might have, for two reasons:

First, I forgot that pop returns a tuple, and it took me a long time to recognize that I was getting a tuple because I had asked for it. Second, I initialized the count to 1, which resulted in dividing by 3 instead of 2. It took me a while to spot that as well.

But we are green. Commit: fake it version of statistics. Now we’d better reflect.

Reflection

My plan is straightforward: we’ll make a dictionary for each name in the name list, summarize into the dictionary, and unwind it to produce the output record.

My concern is whether the fumbling above was telling me anything. If it’s just that I am prone to small mistakes, well, that’s true. I suspect also that 11 lines of code is a lot to be writing, and if twenty percent of them are slightly wrong that’s about average. (Wow, but could be true. That’s why we write tests. We all inject errors at some non-zero rate. I may be particularly good at it, is all.)

I think I’ll just be careful, but I’ll stay alert for more indications that I may need a rest or something. Let’s go for the dictionary. It would be nice to have a test that got another statistic. Let’s add a bonus field to each record. Then I’ll output those stats as well, and the test will break, showing us whether we got it right or not.

I gave the it department ten percent, and sales twenty. Darn sales team, always talk management into things. We’ll see the result in the output.

I plan to write the code out longhand, but it will need refactoring. If I get confused, I’ll do some extracts right away.

    def statistics(self, fields):
        pay_count = 0
        pay_sum = 0
        dict = {}
        for field in fields:
            dict[field] = { field+'_count': 0, field+'_sum': 0}
        for e, s in self:
            for field in fields:
                value = e[field]
                count_name = field+'_count'
                sum_name = field+'_sum'
                entry = dict[field]
                entry[count_name] = entry[count_name] +1
                entry[sum_name] = entry[sum_name] + value
        pay_mean = dict['pay']['pay_sum'] / dict['pay']['pay_count']
        new_items = XSet.from_tuples(((pay_count, 'pay_count'), (pay_sum, 'pay_sum'), (pay_mean, 'pay_mean')))
        key, _scope = self.pop()
        result = key | new_items
        return result

The test is green. My dictionary, hideous though it is, is working. I should have changed the test but I’ll do that in a moment. Let’s unwind the dictionary into the record. Just a bit more coding, I’m sure …

OK, I am fried. But I have this test:

    def test_statistics(self):
        report_lines = []
        personnel = self.build_peeps()
        for dept in personnel.group_by('department'):
            report_lines.append(dept.name)
            for job in dept.values.group_by('job'):
                report_lines.append("    " + job.name)
                for pay in sorted([worker['pay'] for worker, scope in job.values]):
                    report_lines.append("        " + str(pay))
                stats = job.values.statistics(['pay', 'bonus'])  # just returns the record
                report_lines.append(f"        pay_mean: {stats['pay_mean']}  bonus_mean: {stats['bonus_mean']}")
        report = '\n'.join(report_lines)
        ...

And this code (ptui!):

    def statistics(self, fields):
        statistics = {}
        for field in fields:
            statistics[field] = { field+'_count': 0, field+'_sum': 0}
        for e, s in self:
            for field in fields:
                value = e[field]
                count_name = field+'_count'
                sum_name = field+'_sum'
                entry = statistics[field]
                entry[count_name] = entry[count_name] +1
                entry[sum_name] = entry[sum_name] + value
        new_tuples = []
        for field_name, field_values in statistics.items():
            for name, value in field_values.items():
                new_tuples.append((value, name))
            sum_name = field_name+"_sum"
            count_name = field_name+"_count"
            mean_name = field_name+"_mean"
            new_tuples.append((field_values[sum_name] / field_values[count_name], mean_name))
        new_items = XSet.from_tuples(new_tuples)
        key, _scope = self.pop()
        result = key | new_items
        return result

And this result:

('it\n'
 '    sdet\n'
 '        10000\n'
 '        11000\n'
 '        pay_mean: 10500.0  bonus_mean: 1050.0\n'
 '    serf\n'
 '        1000\n'
 '        1100\n'
 '        pay_mean: 1050.0  bonus_mean: 105.0\n'
 'sales\n'
 '    closer\n'
 '        1000\n'
 '        1100\n'
 '        pay_mean: 1050.0  bonus_mean: 210.0\n'
 '    prospector\n'
 '        10000\n'
 '        11000\n'
 '        pay_mean: 10500.0  bonus_mean: 2100.0')

And that is what I intended. Time for a break, so I’ll wrap the article here, and clean up the broken test with the correct answer later.

Summary

The dictionary of dictionaries thing works, but it’s not convenient. I suspect that we would benefit from a little object of our own to contain statistics and return vales and names in a convenient form, like tuples or a set. StatisticsSetBuilder, perhaps? Don’t know, will deal with it when fresh.

The code is horrible, and possibly I should just toss it and do again, but I’ll decide about that later. I can probably refactor it from here. Anyway, time for a rest just now, not for hard decisions.

This has gone well. The statistics method looks to be useful in generating a report. I’m not sure if there is some kind of an overall set of records containing keys and statistics, but it might be a good thing to have. The main issue that I have in thinking about it is that I don’t know what someone might want. Another issue is that a typical report has subtotals, sub-subtotals, totals, and so on. That seems to imply a very nested set if we were to build a set, and having just burned my mental fingers on trying to think about nested sets, I’m not gonna go there.

For now, a break. See you next time!