FAFO on GitHub

This morning I’m working toward a join operator, by way of a grouping operator. I have vague ideas, and need some help from my code.

Hello, friends!

Naively, we join two sets on some common scope values, with the resulting set consisting of all the matching pairs. We’ll refine the notion later on if all goes well, but almost certainly not today.

We would like to do this operation at least somewhat efficiently, so it would be good to have some rapidly accessible way, given the key from a record of one set, to quickly find the matching records from the other set. One famous way to do that is to sort both sets on the key values and do a merge. For some reason, I don’t want to do that, and I want to try a different notion, starting from grouping.

We have occasions to “group” a set based on some key scopes. We might group personnel by department, to get count and total pay by department. That sort of thing. One famous way to do that operation is also to sort the set on the keys and then stream over it, summing up.

The sort-merge style of operation has the advantage that there is no need to ever have the contents of a large set in memory. File sorts are “well known”, and the merge operation can stream from the input files to an output file while only considering one record at a time.

Ever since I put a billion record set in memory on my laptop, I am less interested in the sort-merge style than I might otherwise be. So I’m thinking of a grouping operation based on a Python dictionary.

The rough idea will be to build a group operator that produces a set, based on an input set, that iterates the set in key order, producing all the elements with one key value, then the element with the next, and so on

As I think about it now in the clear dark of sun not up yet, I think we might come up with another “virtual” XImplementation class, an XGroup, with the desired behavior. From the viewpoint of someone reading that set, it will just look as if the original set is sorted by the key. Which, now that we think of it, would mean that it would be amenable to a merge kind of operation to get a join. (We’ll have to refresh ourselves on the various types of joins in due time.)

Let’s create a new test class and start some experimenting.

OK, I’m not proud of this but in my defense, it’s 6 AM and I am not a clever man. It does run:


class TestGroup:
    def test_group(self):
        j1 = SetBuilder()\
            .put("jeffries","last")\
            .put("ron", "first")\
            .set()
        j2 = SetBuilder()\
            .put("jeffries","last")\
            .put("tom", "first")\
            .set()
        h1 = SetBuilder()\
            .put("hendrickson","last")\
            .put("chet", "first")\
            .set()
        h2 = SetBuilder()\
            .put("hendrickson","last")\
            .put("sue", "first")\
            .set()
        peeps = XSet.n_tuple((j1, j2, h1, h2))
        group = dict()
        for e,s in peeps:
            last = e["last"]
            folx = group.get(last, list())
            folx.append((e,s))
            group[last] = folx
        report = "\n"
        for last in group:
            report += last + "\n"
            for peep, s in group[last]:
                report += "    " + peep["first"] + "\n"
        expected = """
jeffries
    ron
    tom
hendrickson
    chet
    sue
"""
        assert report == expected

The dictionary in question, group, is a plain vanilla dictionary, with string keys (not tuples) pointing to a list of records (not a set of records). I did that test to get a sense of how to build up the dictionary.

Now let’s think about what we might really want. In general, we will group by a small number of scopes. Since we’re doing sets, that should probably be a set of scopes, although internally we can do what we like. But we will need the key and value, not just the value as I used above.

The dictionary needs its keys (which will be a set of key-value pairs for us) to be hashable. We can use a set of tuples for that, or we could use an actual XSet containing just those key-value (scope-element) tuples.

Even writing the simple test above, I stumbled over the question of whether I had a set in hand or something else, so we’ll do well to try to be consistent and I think that means using sets throughout. Let’s devise another test and work in that direction.

I honestly do not see a good way to test this. Here’s the next bad way:

    def test_two_keys(self):
        e1 = SetBuilder()\
            .put("it", "department")\
            .put("serf", "job")\
            .put(1000, "pay")\
            .set()
        e2 = SetBuilder()\
            .put("it", "department")\
            .put("serf", "job")\
            .put(1100, "pay")\
            .set()
        e3 = SetBuilder()\
            .put("it", "department")\
            .put("sdet", "job")\
            .put(10000, "pay")\
            .set()
        e4 = SetBuilder()\
            .put("it", "department")\
            .put("sdet", "job")\
            .put(11000, "pay")\
            .set()
        e5 = SetBuilder()\
            .put("sales", "department")\
            .put("closer", "job")\
            .put(1000, "pay")\
            .set()
        e6 = SetBuilder()\
            .put("sales", "department")\
            .put("closer", "job")\
            .put(1100, "pay")\
            .set()
        e7 = SetBuilder()\
            .put("sales", "department")\
            .put("prospector", "job")\
            .put(10000, "pay")\
            .set()
        e8 = SetBuilder()\
            .put("sales", "department")\
            .put("prospector", "job")\
            .put(11000, "pay")\
            .set()
        peeps = XSet.n_tuple((e1,e2,e3,e4,e5,e6,e7,e8))
        scopes = SetBuilder()\
            .put("department", "department")\
            .put("job", "job")\
            .set()
        groups = {}
        for person, s in peeps:
            keys = person.re_scope(scopes)
            records = groups.get(keys, list())
            records.append((person, s))
            groups[keys] = records
        print()
        for k in groups:
            print("keys", k)
            items = groups[k]
            for e,s in items:
                print("    rec", s, e)
        assert False

When I look at the test failure, I see what I expect:

keys XSet(frozenset({('serf', 'job'), ('it', 'department')}))
    rec 1 XSet(frozenset({(1000, 'pay'), ('serf', 'job'), ('it', 'department')}))
    rec 2 XSet(frozenset({(1100, 'pay'), ('serf', 'job'), ('it', 'department')}))
keys XSet(frozenset({('sdet', 'job'), ('it', 'department')}))
    rec 3 XSet(frozenset({(10000, 'pay'), ('sdet', 'job'), ('it', 'department')}))
    rec 4 XSet(frozenset({(11000, 'pay'), ('sdet', 'job'), ('it', 'department')}))
keys XSet(frozenset({('sales', 'department'), ('closer', 'job')}))
    rec 5 XSet(frozenset({('sales', 'department'), (1000, 'pay'), ('closer', 'job')}))
    rec 6 XSet(frozenset({(1100, 'pay'), ('sales', 'department'), ('closer', 'job')}))
keys XSet(frozenset({('prospector', 'job'), ('sales', 'department')}))
    rec 7 XSet(frozenset({('prospector', 'job'), ('sales', 'department'), (10000, 'pay')}))
    rec 8 XSet(frozenset({(11000, 'pay'), ('prospector', 'job'), ('sales', 'department')}))

The good parts of that include making the key set with re-scope, which is just what that operation is good for, and the fact that the loop that builds the groups dictionary is straightforward.

The bad part is that I don’t really see a good way to test the result. I suppose I could do an approval test, just copying the repr for the set but that seems pretty weak.

And setup was pretty awful with those 8 SetBuilder things. There’s a lot of duplication in there that we could get rid of if we were smarter than we (I) are (am).

But in terms of learning what to do, we’re doing well. Yes, our airplane is built out of an orange crate but it does fly.

When I built that test, I had in mind someone who wanted to get total salary by job within department. The set we have isn’t really the right one for that, is it? We really want to group by department and then by job within that. And for that query we want summing, not record accumulation.

But right now, we’re just thinking about grouping, because we think it might lead us to join, as well as being useful on its own.

Suppose we did want to be grouped by department and job within department, something like this:

it
    serf 
        1000
        1100
    sdet 
        10000
        11000

sales
    closer 
        1000
        1100
    prospector 
        10000
        11000

Hm, closers should make more than prospectors. We definitely have a problem in sales. But I digress. How would we build such a thing?

Proceeding outside to in, we could get a group for it with all the it people in it and a group for sales with all the sales people. Then pass over that, grouping the inner sets? and if there were three levels, or six?

I think we sort of have to do that. We’ll burn that boat when we come to it. For now, I’ve got an idea and I plan to go back to bed.

Later that day …
I’m back after a lovely surprisingly long additional sleep. Unfortunately I don’t have a brilliant idea for how to put reasonable assertions into that last test.

I’ll start by cleaning up the test a bit, extracting the construction of the peeps, and using defaultdict:

    def test_two_keys(self):
        peeps = self.build_peeps()
        scopes = SetBuilder()\
            .put("department", "department")\
            .put("job", "job")\
            .set()
        group_dictionary = defaultdict(list)
        for person, scope in peeps:
            keys = person.re_scope(scopes)
            group_dictionary[keys].append((person, scope))
        print()
        for k in group_dictionary:
            print("keys", k)
            items = group_dictionary[k]
            for e,scope in items:
                print("    rec", scope, e)
        assert False

I think what I need to do is to write down what I wish I could say and then see if Python and I can come to some compromise.

    def test_two_keys(self):
        peeps = self.build_peeps()
        scopes = SetBuilder()\
            .put("department", "department")\
            .put("job", "job")\
            .set()
        group_dictionary = defaultdict(list)
        for person, scope in peeps:
            keys = person.re_scope(scopes)
            group_dictionary[keys].append((person, scope))
        it_serfs = find(group_dictionary, 'it', 'serf')
        assert len(it_serfs) == 2
        for peep, s in it_serfs:
            assert peep['job'] == 'serf'

That might do the job. We need find:

def find(group_dictionary, department, job):
    for key_set in group_dictionary:
        d = key_set['department']
        j = key_set['job']
        if d == department and j == job:
            return group_dictionary[key_set]
    return None

The test passes. I am not surprised. Let’s commit these two tests, then improve them a bit more. Commit: initial tests toward group operator.

PyCharm made an external function for me. I’ll bring it inside the class: I just prefer that.

    def test_two_keys(self):
        peeps = self.build_peeps()
        scopes = SetBuilder()\
            .put("department", "department")\
            .put("job", "job")\
            .set()
        group_dictionary = defaultdict(list)
        for person, scope in peeps:
            keys = person.re_scope(scopes)
            group_dictionary[keys].append((person, scope))
        it_serfs = self.find(group_dictionary, 'it', 'serf')
        assert len(it_serfs) == 2
        for peep, s in it_serfs:
            assert peep['job'] == 'serf'

    def find(self, group_dictionary, department, job):
        for key_set in group_dictionary:
            d = key_set['department']
            j = key_set['job']
            if d == department and j == job:
                return group_dictionary[key_set]
        return None

PyCharm objects to the name find. Oh. Could be static. Yes, it could. I don’t care.

Reflection

I’m assuming that we’ll build a new kind of XImplementation, an XGroup, that contains a dictionary like the one we’ve built here, with its keys being XSets of group key scopes and elements, and the values of the dictionary being a list of records with those key scopes and values.

A good question, however, is “if XGroup represents an XSet, what is the form of that set?” I can think of two reasonable possibilities:

  1. It might just be the original set with the curious property that it produces its elements grouped.

  2. It might actually be a grouped set, with each set of key values associated with an inner XSet containing the original set elements with those key values.

That’s hard to describe. Ignoring scopes, it might look like this:

{
    { it, serf {
            { it, serf, 1000 },
            { it, serf, 1100 }
        }
    },
    { it, sdet {
            { it, sdet, 10000 },
            { it, sdet, 11000 }
        }
    },
    { sales, closer {
            { sales, closer, 1000 },
            { sales, closer, 1100 }
        }
    },
    { sales, prospector {
            { sales, prospector, 10000 },
            { sales, prospector, 11000}
        }
    }  
}

There are two reasons to prefer the second idea. First, there is generally no provision in the system (yet) for producing a set in a predictable order, and we should provide that thoughtfully. Second, if we implement the first idea, code that uses the group set would have to keep track of the current key set and notice when it changes.

In the second idea, the keys still come out in arbitrary order and then you can fetch the records and iterate them as you wish. The main objections to #2 are that it’s the first time we’ve created a set that’s nested that way, and it’s complicated to think about.

I think we’re going to go with the nested set idea. And we should sum up here: the article is too long to read already.

Summary

With a few tests and some refactoring and some thinking, we have what seems like a good steep toward dealing with grouping sets. We’ll probably have to use it a bit to be sure we like how it works.

We’ll start building the XGroup itself in the next article, unless I spot a squirrel.

See you then!