FAFO on GitHub

Yesterday’s experiment with group_by seemed very successful. What does it tell us about what we’re doing and what we should be doing? Bit of a retrospective.

Hello, friends!

Yesterday, we got this simple but evocative report:

it
     sdet
         10000
         11000
     serf
         1000
         1100
sales
     closer
         1000
         1100
     prospector
         10000
         11000

The code to create the report looks like this:

class XSet:
    def test_group_by(self):
        print()
        personnel = self.build_peeps()
        for dept in personnel.group_by('department'):
            print(dept.name)
            for job in dept.values.group_by('job'):
                print("    ", job.name)
                for pay in sorted([worker['pay'] for worker, scope in job.values]):
                    print("        ", pay)

We’ll convert that to a real test in a moment. YT was “done working” yesterday by the time that worked.

Our group_by method is not a set operation in the strict sense. It is a method on the XSet class, but it does not produce a set as its result. Instead it returns a list of instances of this tiny class:

class GroupHolder:
    def __init__(self, name, values):
        self.name = name
        self.values = values

The name value in the return is the value of the group_by parameter, which is a scope, such as department. We create and return a GroupHolder for each unique element with that scope in the input set. The values member in the GroupHolder is a set, namely the set of all the elements of the source set that have that particular element under the input scope.

For every department x, we get a GroupHolder with name x and values being the set of all elements of the set with department x.

I think it very likely that this method would do strange things if we were to give it a source set some of whose elements had more than one department. In the fullness of time, we should test that and perhaps do something about it. It might almost work, although it would surely not show more than one name in the name field.

Our first mission this morning, is to reflect on this result, in the light of a few sessions trying to build a set type, XGroup, that could do this sort of thing.

Reflection

Let me just put it right out there: in my attempts to build a set containing much the same information as we have in our list of GroupHolders, I confused myself regularly. It was one ball more than I could juggle, thinking about a thing that had to be a set of sets, each inner set containing two other sets, one with names and the other with values, consisting itself of a set or records each of which was a set …

I found it very difficult to write tests for the thing, for two reasons. First, as mentioned above, it was hard to figure out what I wanted. But second, creating tests with the desired characteristics was very difficult. We do not have facilities for creating nested sets. Based on my difficulty understanding and visualizing them, that may be a good thing.

I think this experience is pointing to a big chunk of learning. Something like this:

Possibly, while simple set operations are easy to understand, we should strictly limit our use of complicated set structures. The operations need to work on any set they encounter, but we need not, and perhaps should not, provide operations that facilitate building highly nested sets.

The XFlat and XFlatFile sets include a symbol table, consisting of a field name, and a starting and ending character index. The XFlat set delivers element-scope pairs by iterating the symbol table and slicing out the characters.

A strict view of things might specify that the symbol table would itself be a set, and that the XFlat would work by iterating that set, pulling out its name, start, end fields,and doing the slice. We do not do that. Instead, a class method on XFlat creates a simple list of tuples, and the record is iterated by processing that list and those tuples. No set operations are involved.

What set operations might one use to do that job? We don’t really want to go there, but we could imagine a two-level design with relations at the top and sets of flat records at the bottom. Then slicing would look like some kind of scope-restrict down at the byte level. We could imagine a system that translated the relational notion of a field value into a series of set operations at the byte level that did the slicing and returned the result to the relational level. Probably not as good an idea as I once thought it was.

Let’s see if we can define what we’re doing here, beyond having fun.

We are building a library in Python, based in Extended Set Theory, that provides a collection of set storage mechanisms, set operations, and other operations, that would be useful for building relational-like and similar information systems.

As such, I think it makes sense that we would provide operations like the group_by, which does use a couple of set operations internally, but which presents an object, not a set, that is useful for hierarchically processing a flat set. The object does contain a set, curated to contain just the records that are just those in the “sub-tree” selected by the object.

What about XGroup?

I think XGroup turns out to be a dead end. It provided some learning, some of it painful, about set and test creation, but the object itself, I think, is not going to be useful. Therefore, our upcoming tasks should include removing it.

Let’s do a little work.

First, I’ll convert the final report test into an actual test:

    def test_group_by(self):
        print()
        personnel = self.build_peeps()
        for dept in personnel.group_by('department'):
            print(dept.name)
            for job in dept.values.group_by('job'):
                print("    ", job.name)
                for pay in sorted([worker['pay'] for worker, scope in job.values]):
                    print("        ", pay)
        assert False

At the cost of a red-bar, that lets me view the result of creating the report, but it’s not really on to save a red-bar test. So we’ll change it, as we have others, to an approval test:

    def test_group_by(self):
        report_lines = []
        personnel = self.build_peeps()
        for dept in personnel.group_by('department'):
            report_lines.append(dept.name)
            for job in dept.values.group_by('job'):
                report_lines.append("    " + job.name)
                for pay in sorted([worker['pay'] for worker, scope in job.values]):
                    report_lines.append("        " + str(pay))
        report = '\n'.join(report_lines)
        expected = """it
    sdet
        10000
        11000
    serf
        1000
        1100
sales
    closer
        1000
        1100
    prospector
        10000
        11000"""
        assert report == expected

We are green. Commit: convert manual check to test.

There are a number of tests in the test_group file that are building up to creating the test shown here. I’ll preserve those but remove all the ones that create XGroup instances.

There was only a test or two that actually created the XGroup. The others are various attempts and manipulating sets to get reports, and I think they’re worth saving for history. Maybe. Anyway we’ll save them today. I’ve removed the XGroup ones and the class, which was still inside the test file. Green. Commit: clean up test, remove XGroup experimental class, thanking it for its service.

The file is even free of PyCharm’s irritating little squiggles about this and that nicety. One of my few: I freely grant that I am not a fanatic about PyCharm’s warnings. So far my choice has not harmed me as far as I know. The usual complaints are failing to space after a comma, and occasionally shadowing a name.

Summary

So, where are we? I think we have adjusted our path just a bit. We are going to allow the creation of useful methods and objects that are not sets, but that are valuable to an applications programmer. We are going to try to be more sensitive to places where the sets are not helping us. This may lead to new operations like group_by, or perhaps to additional set operations that makes things more convenient.

I need to be ever more sensitive to things seeming difficult: it is almost always a sign that I’m on the wrong path. I suffer a bit from target fascination and would do well to hold on a bit more loosely to a starting idea.

We would do well to note that the new group_by operation was tested by using it to create a report, and then comparing the string to what it should be. It would have been nice to have had smaller tests. I think that the new report was my focus, since I had sketched it the session before I started on it, but some micro-tests would have probably been better. So we need to focus on smaller tests.

Finally, it would really help to have some kind of application in mind for this thing. At the moment, I do not have one, which means that my choices of operations to work on is very speculative.

We’ve made some interesting discoveries and, in particular, allowing non-set operations as part of our work should pay off in delivering useful results sooner.

See you next time!