Regroup

FAFO on GitHub

I’m here for another step on the way to the grouping operator. I have at least one new realization going in. More than one coming out. Some code progress as well.

Hello, friends!

I’ve been trying to build a new kind of XSet implementation, XGroup. I arrive here realizing that the set I’m building is far more complicated—I call it curly—than those that have come before. And as I write this, I begin to question whether a new XImplementation is really what’s needed. Next thing you know I’ll be questioning my own existence.

The basic idea, I keep telling myself, is simple. When we have a large set including some key values, like, say, “department”, and some data values like “pay”, we often want to view or summarize the data based on the keys. Show me everyone by department. Show me total pay by department. That sort of thing.

I’ve had in mind an operation that produces a moderately complicated set. Whatever the keys are, the result set will include a set for each existing combination of keys. That set will contain two sets, named “keys” and “values”. The keys set will be the actual values of the keys, scoped by the name of the key. The values set will be all the elements of the source set that have those specific key values, each element scoped by its scope in the source set.

The result set groups the records of the input set into groups with the same keys.

It seems simple enough. However, I am coming to suspect that it is not, in fact, simple enough. If it were simple enough, I would be further along than I am. Reading the last few articles about grouping, I note that I spent rather more time confused than I’d like to spend. Let’s belay the XGroup idea for a bit and work with some sets and operations that might be useful.

The ‘project’ operator can get the key combinations that we need. Let me demonstrate using a new test.

    def test_group_keys_via_project(self):
        peeps = self.build_peeps()
        fields = XSet.classical_set(("department", "job"))
        keys = peeps.project(fields)
        print()
        lines = []
        for key_set, scope in keys:
            record = []
            for key,value in key_set:
                record.append(f'{value}^{key}')
            lines.append(','.join(sorted(record)))
        print('\n'.join(sorted(lines)))
        assert False

Often with these things, I like to print a result to be sure I’m getting what I need: it’s not easy, and certainly not convenient to write tests for the result of a set operation. The above test prints this:

department^it,job^sdet
department^it,job^serf
department^sales,job^closer
department^sales,job^prospector

It took me a few tweaks to get that little report to look the way I wanted it with two (2) sorts, one to get the individual key lines in the right order, and one to get all the combinations in the right order. Why? Because sets do not have any inherent order.

We’ve learned something important here: at least one user, name of Ron, wants the results of grouping to come out in a particular order, in this case all the departments in alpha order and then all the jobs within the departments, also in alpha order. And, of course, we have also verified what we expected, namely that the project operation actually creates the sets of key values that we need.

Now it also turns out that we can get the records that go with a given set of those key values. The ‘restrict’ operator can do that. Let’s do another test right now, to get both ideas concrete.

Some grumbling occurs. Here’s a test and its result:

    def test_restrict_returns_group_records(self):
        peeps = self.build_peeps()
        fields = XSet.classical_set(("department", "job"))
        keys = peeps.project(fields)
        key_set, _scope = next(iter(keys))
        key_set_set = XSet.classical_set([key_set])
        print(key_set)
        folx = peeps.restrict(key_set_set)
        print(folx)
        assert False

Result:

XSet(frozenset({('closer', 'job'), ('sales', 'department')}))
XSet(frozenset({
    (XSet(frozenset({(1100, 'pay'), ('closer', 'job'), ('sales', 'department')})), 6), 
    (XSet(frozenset({(1000, 'pay'), ('closer', 'job'), ('sales', 'department')})), 5)}))

The good news is that, yes, we can do a restrict with one of the elements of our projected set and indeed fetch the records we need.

There is bad news as well. Let’s make a list.

Set printouts are not easy to read. We might do well to come up with better printing. If we do, think about large sets.
Restrict properly expects a set of restriction “records”, each with some keys and values. In our situation, we only have one such record at a time in mind, so we had to wrap that record with that second classical_set line, which was inconvenient and far from obvious. (It took me three or four tries to work out what it wanted.)
The test doesn’t test anything.

I’ll try to make it a test.

    def test_restrict_returns_group_records(self):
        peeps = self.build_peeps()
        fields = XSet.classical_set(("department", "job"))
        keys = peeps.project(fields)
        key_set, _scope = next(iter(keys))
        key_set_set = XSet.classical_set([key_set])
        folx = peeps.restrict(key_set_set)
        assert len(folx) == 2
        department = key_set['department']
        job = key_set['job']
        for peep, _scope in folx:
            assert peep['department'] == department
            assert peep['job'] == job

That’s not bad. We check to ensure that whatever the department and job were in the key_set, we got the same values in the result set. And, of course, by construction we know that we should have two records in each group.

Let me do some renaming there.

    def test_restrict_returns_group_records(self):
        personnel = self.build_peeps()
        fields = XSet.classical_set(("department", "job"))
        groups = personnel.project(fields)
        group_keys, _scope = groups.pop()
        search_set = XSet.classical_set([group_keys])
        selected_people = personnel.restrict(search_set)
        assert len(selected_people) == 2
        department = group_keys['department']
        job = group_keys['job']
        for person, _scope in selected_people:
            assert person['department'] == department
            assert person['job'] == job

Some better names, and I used the pop method to return an arbitrary element. Let’s see if we can make the other test into an actual test, so we can commit these tests and see where we stand. Here’s what we have:

    def test_group_keys_via_project(self):
        peeps = self.build_peeps()
        fields = XSet.classical_set(("department", "job"))
        keys = peeps.project(fields)
        print()
        lines = []
        for key_set, scope in keys:
            record = []
            for key,value in key_set:
                record.append(f'{value}^{key}')
            lines.append(','.join(sorted(record)))
        print('\n'.join(sorted(lines)))
        assert False

Heck, why not make an approval test of this?

    def test_group_keys_via_project(self):
        peeps = self.build_peeps()
        fields = XSet.classical_set(("department", "job"))
        keys = peeps.project(fields)
        print()
        lines = []
        for key_set, scope in keys:
            record = []
            for key,value in key_set:
                record.append(f'{value}^{key}')
            lines.append(','.join(sorted(record)))
        report = '\n'.join(sorted(lines))
        expected = \
"""department^it,job^sdet
department^it,job^serf
department^sales,job^closer
department^sales,job^prospector"""
        assert report == expected

That passes. Commit: adding tests moving toward grouping.

Reflection

OK, solid result here: the key-value sets we need are those provided by projecting the input set on the field names we want in the groups, and the individual groups of records are just the restriction of the input set by the key-value set we want. Now the idea of th XGroup was to create all the things we needed in one scan of the set, putting the key-value sets into a dictionary, with values being the corresponding records. So far, my attempts to produce that result have failed.

I think the failures are all similar to the issue in the tests above: the items coming back from one part of the operation aren’t quite what the other parts expect. Mostly a matter of expecting another layer of set wrapping.

And there are the remaining issues of the inscrutability of the set prints, and the general difficulty of writing tests that are easy to write and understand.

Imagine what a full test would be in place of what we have here:

    def test_restrict_returns_group_records(self):
        personnel = self.build_peeps()
        fields = XSet.classical_set(("department", "job"))
        groups = personnel.project(fields)
        group_keys, _scope = groups.pop()
        search_set = XSet.classical_set([group_keys])
        selected_people = personnel.restrict(search_set)
        assert len(selected_people) == 2
        department = group_keys['department']
        job = group_keys['job']
        for person, _scope in selected_people:
            assert person['department'] == department
            assert person['job'] == job

A test might want to say something like this:

There should be four groups in the set groups.
Each group should include a different pair of department and job, for department sales or it and job either sdet or serf or closer or producer, with sdet and serf paired with it and the others with sales.
For each of those groups, the restrict should produce records that match the key field in the group.
Every record should appear in exactly one restrict.

I think we can at least do the check for all the records in the groups set. Let’s do that.

    def test_restrict_returns_group_records(self):
        personnel = self.build_peeps()
        fields = XSet.classical_set(("department", "job"))
        groups = personnel.project(fields)
        for group_keys, _scope in groups:
            search_set = XSet.classical_set([group_keys])
            selected_people = personnel.restrict(search_set)
            assert len(selected_people) == 2
            department = group_keys['department']
            job = group_keys['job']
            for person, _scope in selected_people:
                assert person['department'] == department
                assert person['job'] == job

Basically same test, just looping over all the groups. Commit: improve test.

Reflect More Generally

Where are we, where do we want to go next? Let’s think about where we want to go and then assess where we are in the light of that.

Whap upside my head: Huh. You think you know what you want and then realize that you don’t really know. Let me mumble a bit here.

We have some input set with a bunch of key fields and value fields. We want to “group” the set by some of the key fields and then process it, perhaps printing it in that order, perhaps processing it further to produce summary information.

Is it possible that what we want is something very nested, like this:

{
    { department=it 
        { job=sdet {
            {pay=10000}
            {pay = 11000}
            }
        },
        { job=serf {
            {pay=1000}
            {pay=1100}
            }
        }
    } ...
}

I hope not: I’m not even sure I got the braces right there. But seriously, we generally do produce indented reports like that, so we’ll at least want the groups sorted together. Nesting them guarantees that the right things are together but it is a bear to unwrap. We begin to see why report generators are so complicated.

Let’s write a test that produces a report, given what we have now for operations. It might not be “efficient”, but we won’t worry about that just yet.

I have good news and bad news, bits of each. The main good news is that I have this report:

Dept: it
    Job: sdet
        3: sdet: 10000
        4: sdet: 11000
    Job: serf
        2: serf: 1100
        1: serf: 1000
Dept: sales
    Job: closer
        6: closer: 1100
        5: closer: 1000
    Job: prospector
        7: prospector: 10000
        8: prospector: 11000

The main bad news is that this is the test that produces it:

    def test_build_report(self):
        def details(rec, scope):
            return f"        {scope}: {rec['job']}: {rec['pay']}"

        print()
        personnel = self.build_peeps()
        get_departments = XSet.classical_set(("department",))
        departments = personnel.project(get_departments)
        department_names = []
        for dept_rec, _dept in departments:
            department_names.append(dept_rec['department'])
        department_names = sorted(department_names)
        report_lines = []
        for dept in department_names:
            report_lines.append("Dept: " + dept)
            dept_rec = XSet.from_tuples([(dept, 'department')])
            dept_set = XSet.classical_set([dept_rec])
            dept_recs = personnel.restrict(dept_set)
            job_names = set()
            for rec, scope in dept_recs:
                job = rec['job']
                job_names.add(job)
            job_names = sorted(job_names)
            for job in job_names:
                report_lines.append("    Job: " + job)
                job_rec = XSet.from_tuples([(job, 'job')])
                job_set = XSet.classical_set([job_rec])
                job_recs = dept_recs.restrict(job_set)
                for rec, scope in job_recs:
                    report_lines.append(details(rec, scope))
        print()
        print('\n'.join(report_lines))
        assert False

More good news is that that’s not quite as awful as it may appear at first glance. in particular, the code for each summary level (department and job) is basically the same, so we can clearly build something not quite so nasty with a bit of refactoring.

A bit of bad news is that we had to extract the names to sort them and then loop over the sorted names, to get the report in a sensible order.

Further bad news is the odd two-step process to get from a summary record to a suitable restriction set, where first we make a set of tuples and then make a classical set of that. We need to look into that, to see whether restrict is expecting something too nested, or whether we need some better support for making useful restriction sets.

Let me state clearly that I cannot visualize what’s going on there and that it took me far too many tries to get it right. The set operations are supposed to make things easier. When they don’t, something probably needs improvement.

I just tried to convert that report to an approval test but it turns out that the final report lines do not come out in alpha order. A SMOP later:

    def test_build_report(self):
        def details(rec, scope):
            return f"        {scope}: {rec['job']}: {rec['pay']}"

        print()
        personnel = self.build_peeps()
        get_departments = XSet.classical_set(("department",))
        departments = personnel.project(get_departments)
        department_names = []
        for dept_rec, _dept in departments:
            department_names.append(dept_rec['department'])
        department_names = sorted(department_names)
        report_lines = []
        for dept in department_names:
            report_lines.append("Dept: " + dept)
            dept_rec = XSet.from_tuples([(dept, 'department')])
            dept_set = XSet.classical_set([dept_rec])
            dept_recs = personnel.restrict(dept_set)
            job_names = set()
            for rec, scope in dept_recs:
                job = rec['job']
                job_names.add(job)
            job_names = sorted(job_names)
            for job in job_names:
                report_lines.append("    Job: " + job)
                job_rec = XSet.from_tuples([(job, 'job')])
                job_set = XSet.classical_set([job_rec])
                job_recs = dept_recs.restrict(job_set)
                local_lines = sorted([details(rec, scope) for rec, scope in job_recs])
                report_lines.extend(local_lines)
        report = '\n'.join(report_lines)
        expected = """Dept: it
    Job: sdet
        3: sdet: 10000
        4: sdet: 11000
    Job: serf
        1: serf: 1000
        2: serf: 1100
Dept: sales
    Job: closer
        5: closer: 1000
        6: closer: 1100
    Job: prospector
        7: prospector: 10000
        8: prospector: 11000"""
        assert report == expected

We are green. Time to commit this test and reflect, and probably sum up. Commit: tests include a generated report to find out how to do it.

Reflection

This idea seemed simple a few days ago, but the real issues that the idea is supposed to address are much more complex than I imagined. The result of a mis-match between the problem and a supposed solution is quite often confusion in the mind of the developer, reflected in slow or zero progress, as well as a general look of confusion. I think this is particularly problematical when one is working alone. If I were working with someone else, it would almost certainly have turned into a conversation about what the heck we’re doing here, and an earlier realization that we had to step back.

Not that it took a terribly long time for me even alone, but a reasonably impatient pair would have stopped us, asking to be brought up to speed on my thinking … and that would have quickly demonstrated that my thinking was lacking in important detail.

We have some small discoveries here that may add up to something:

Grouping is not enough: we’ll really need sorting as well.
Sorting is not a solved problem in our system: we have no way of representing a sorted set, certainly not one that matches up with set theory.
The weird step between a perfectly good element of the projected set and the set we need to restrict out the corresponding records is far from obvious.
There might be room for a convenience operation that, given a set, returns a (sorted?) list of a particular field’s values. (I’m thinking of that two-step process where I got the department or job names just to sort and loop over them.)
I wonder what we could do about reporting. One possibility is a little report generator, but a more useful thing might be a general nested or recursive looping facility. Needs thinking.

Summary

I believe that all the thinking I might have done before these few days of coding would not have given me the understanding that I now have. I think I’d have had more concrete notions of what should be done … and those notions would not very well fit the real needs of users of our operations. I think we’d have encountered similar difficulties, and we’d have been less prepared to change direction because of all the commitment we’d feel to our thinking.

I find it better to turn to code early in my design process, because it keeps me in touch with reality, whether that’s something small like the two-steps needed to get from what I have to what I want, or the larger realizations about the nature of the problem and therefore the nature of the solution we need.

We have a few tests running, including one that produces a rudimentary report such as we might need. We’re still in the woods, but we are coming closer to being out of them, and when we emerge, I think it’ll be with something even better than what we came in here to find.

Learning is good. Today is good.

Next time, maybe we’ll play further with what a reporting program might want to look like. I’m thinking something like:

for dept in personnel.grouped_by('department')
    print(dept.name)
    for job in dept.values.grouped_by('job')
        print("    ", job.name)
        for rec in job.values.sorted_by('job')
            print("        ", rec.name, rec.pay)

I have no idea—well only a small idea—how that might be done. But we might be able to get close.

See you next time!