Duplicates

FAFO on GitHub

The swirl that is my life leads me to think about duplicates and such. Very speculative musings, no conclusion. Sort of Captain’s Log Stardate 77691.1.

You really don’t want to read this. Scan it if you’re really low on things to do.

Ingredients:

A long email that I wrote explaining what I find important in programming Extended Set Theory.
A long slack exchange about whether MBA students are all evil.
A test that gets the “right” answer but some excess info.
An observation about what would happen if I fixed #3.
Some odd stuff …

Let’s start with the easy parts. I wrote this new test for the statistics method:

    def test_whole_set(self):
        peeps = self.build_peeps()
        stats = peeps.statistics(['pay', 'bonus'])
        assert stats['pay_count'] == 8
        assert stats['pay_sum'] == 46200
        assert stats['pay_mean'] == 46200/8
        assert stats['bonus_count'] == 8
        assert stats['bonus_sum'] == 6930
        assert stats['bonus_mean'] == 6930/8

And yes, those are the right totals and means for our two-department, four job, eight person “database”.

The “interesting” thing about that test is what else is in the resulting stats set. You may recall that when we create our result set, we use an example record from the input, and we remove the detail items for the scopes for which we are creating statistics, in this case ‘pay’ and ‘bonus’. But there are other fields in the input, of course. The input records have ‘department’, ‘name’, ‘job’, ‘pay’, and ‘bonus’. So our output record includes a value for each of department, name, and job: the values in the first input record.

I suppose these are harmless, but if someone were just to print the statistics record, they’d see some odd stuff:

pay_sum: 46200
name: joe
job: serf
pay_mean: 5775.0
pay_count: 8
bonus_mean: 866.25
department: it
bonus_sum: 6930
bonus_count: 8

Now, in general, we can’t just remove everything but the statistics scopes, because, in general, we use the statistics on a set that has been grouped by a key field. Statistics by department, or by job.

After I realized those spurious values were in there, I wanted to have the test show how one “should” do this, but it turns out that I don’t know how to do that. I thought of projecting out the fields that we don’t want, so that the input set would only have pay and bonus. But that, too, is fraught. Why? Because projection removes duplicate records, by the relational definition of the operation. If we projected out everything but pay,for example, we’d only get four records, not eight:

    def test_project_whole_set(self):
        peeps = self.build_peeps()
        projector = XSet.classical_set(('pay',))
        pay_and_bonus = peeps.project(projector)
        assert len(pay_and_bonus) == 4

Why? Because there are only four unique values of pay. There happen to be eight combinations of pay and bonus but in a larger database, those two would be likely to compress out.

Now we have an issue that needs resolving, namely whether and how to get rid of fields in statistics records that don’t really belong there, since clearly name and job don’t belong in the example above. We could say nothing about them, sort of don’t ask don’t tell, but that doesn’t sit well with me. However, my mind took me to a different place.

Allegedly interesting set theory fact

In Set theory, { a } is equal to { a, a }. Why? Well, the real answer is almost certainly “because we defined it that way” and the answer to “why” on that one is something like “it works out better that way”. Formally, it comes down to the fact that in set theory you can get an answer to “is a in the set”, but not to “how many copies of a are in the set”.

That leads to some interesting notions. For example, consider the union of these two sets:

{ a, b, c } ∪ { c, d, e }

We expect that the result will be { a, b, c, d, e } but we get a perfectly good answer by just appending the two sets to each other:

{ a, b, c, c, d, e }

Now as it happens, our current code almost always creates a frozen set implementation of a result set, and that implementation happens to remove duplicates. It was part of what I liked about the implementation, in fact. But what if we did not, in general, remove duplicates? Set-theoretically it shouldn’t matter. I suspect that as one worked one would get more and more duplicates, but it might still be faster than removing them. And then, we might provide ways to force the removal of duplicates, at the edges of the system where a user application or a human user might expect the compact form.

And you may ask yourself, “Well, how did I get here?”, but if we didn’t have to crunch out the duplicates so often, we could probably write some super fast operations.

I think that topic deserves consideration. It doesn’t directly address our statistics question, because it’s pretty clear that if we have a set of people with age fields, we cannot get the average age by projecting out all the ages and then averaging them. We need to consider how many there are of each age. So my statistics concert remains unaddressed. I’ve put it on the list of things to think about.

What about statistics?

Yes, well, we don’t have to answer all the questions. We have a very interesting group_by operation that partitions a set, and our supposed practice with statistics will be to do a suitable grouping and then apply the stats to each of the groups. Since the groups contain complete records from the original set, there is no loss of records due to set compression.

But we don’t really have very many tests for statistics yet, and I’m not willing to put a lot of money down saying that I am certain exactly what it does in all cases.

I think statistics is usable, but the main purpose of both statistics and group_by was to explore whether and how we can implement those capabilities. If we had not been able to figure out a way, we wouldn’t have a viable set library on our hands. As things stand, we know that we can very likely do what is needed. Whether what we have is truly convenient is another question … but it’s not bad.

Blue sky

One of our two tests for statistics is one that produces a report:

    def test_statistics(self):
        report_lines = []
        personnel = self.build_peeps()
        for dept in personnel.group_by('department'):
            report_lines.append(dept.name)
            for job in dept.values.group_by('job'):
                report_lines.append("    " + job.name)
                for pay in sorted([worker['pay'] for worker, scope in job.values]):
                    report_lines.append("        " + str(pay))
                stats = job.values.statistics(['pay', 'bonus'])  # just returns the record
                report_lines.append(f"        pay_mean: {stats['pay_mean']}  bonus_mean: {stats['bonus_mean']}")
        report = '\n'.join(report_lines)
        expected = ('it\n'
                    '    sdet\n'
                    '        10000\n'
                    '        11000\n'
                    '        pay_mean: 10500.0  bonus_mean: 1050.0\n'
                    '    serf\n'
                    '        1000\n'
                    '        1100\n'
                    '        pay_mean: 1050.0  bonus_mean: 105.0\n'
                    'sales\n'
                    '    closer\n'
                    '        1000\n'
                    '        1100\n'
                    '        pay_mean: 1050.0  bonus_mean: 210.0\n'
                    '    prospector\n'
                    '        10000\n'
                    '        11000\n'
                    '        pay_mean: 10500.0  bonus_mean: 2100.0')
        assert report == expected

There’s a general pattern to generating reports like this, and it seems to come down to loops within loops, one loop per level in the report. An amusing problem would be to develop a thing — maybe we would call it a report generator — that allows us to specify format information and summary requests for each level of a report, and then this report generator thing would just spin through the data, doing the looping or recursion or whatever it takes to get the report.

To do a good one would be tedious and full of detailed junk, at least if we base our expectations on all the so-called report generator programs that have ever existed. It would, however, almost certainly drive out some discoveries about the kinds of operators and structures we need in our XST library.

Even bluer sky

Much of my focus here has been on offering fully-capable extended sets and operations, that can support any extended set no matter how nested and curly it might be. I think we actually have that.

But there is a lesson lurking in group_by. I worked for a few days trying to work out a nested set structure for grouping, which I envisioned as a set with two elements, the first one being a group key and value, and the second being all the records from the original set with that particular key value. A grouping of an input set would produce a set of those two-element sets, with an element for each key value. And, at least sometimes, I was thinking that the sets inside might be sub-groups …

I tried multiple times to write tests that expressed how this was going to work and it was Just Too Damn Hard. I couldn’t keep it in my head and my existing set creation methods weren’t up to creating anything that complicated.

I got the clever idea of producing a Python object with two fields, the name, containing the particular key value of this grouping, and the values, containing all the records with that key value.

And we create that set in an easy way: we pass over the data once to get all the key values. Then, for each key value, we pass over the data again, selecting the records that match that value. So if you were some giant corporation with hundreds of departments, our group_by operator is going to make hundreds of passes over all the tens of thousands of employee records you have.

One might argue that that’s terribly inefficient, and one would be right. Surely the right way to do the thing would be more like we do the statistics, where we do one pass over the data and get everything we need from a record the one and only time we see it.

But my point, and I do have one, is that this way of doing the grouping makes sense to me as a developer, because I could encompass the problem and solution, and it provides an object, not a set, that is easier to use than the set would be.

Decades back, Dave Childs wrote a program, STDS, that presented a very simple data model, implemented in a very simple and very fast way, essentially treating all data as an array of arrays of characters. The thing was very hard to use, because you had to figure out how to express the problem in those terms, but it flew.

What we have in hand at the moment is some very capable set operations, but they are not always easy to use, and they are more capable than we really need. Now that’s not necessarily bad: mathematical generality can lead to very simple expression of solutions. But here … that’s not always happening. Some solutions are really hard to express … and some, given the limitations of my brain … are nearly impossible.

Maybe … maybe … maybe there’s a better balance than what we have here. Maybe we need more task-specific objects like the GroupHolder, that make it easier to use these operations, with the actual set operations hidden inside those objects. Is that part of the job of the team implementing the set operations? Or is it the job of the team implementing some application that uses set operations? Well, it’s a joint effort, I think.

It will take collaboration between the two layers. The group_by is a perfect example, in that it provides a view of the data that is easy to understand and easy to create … but that requires multiple passes over the input sets. Imagine that I were to create a big data set and do some grouping on it. It might be slow.

I would come to myself with my applications hat on and say, self, this is slow. And with my set theory hat on, I’d look at the code and say “look, you’re selecting from the same set a thousand times, what did you expect?” That would not be helpful, because with my application programmer hat on, first of all this is a very reasonable way to express what I want, and second, I don’t know a better way to do it. (And neither does the guy with the set theory hat, we happen to know, because he tried and failed to do it in one pass.)

Working together, we might devise a new approach, and that approach might well require some new set operations that wold do the job in the single pass that is possible. Or, it might just require a combination of set and application thinking, but no new operations.

No real conclusion. I think we’re exposing too much set theory in some cases, and I think that we might do better with a different underlying model. Even if we do allow for arbitrary sets, maybe we could keep simple sets in a simple form and only create our slower forms when absolutely needed.

But truth is … we don’t even know for sure that those forms are too slow. After all, I did create and process a billion-record set in memory. Didn’t even hurt that badly.

I hope you didn’t read this. SOme of these are just me, thinking and deigning and scheming.