FAFO on GitHub

Our work on calculations raises some longer-term concerns about large dataset and statistics. Are we in big trouble?

Hello, friends!

With field calculations pretty much in hand, my thoughts this morning while trying not to get up went to large data sets, timing, and, inevitably, statistics like “which serfs are paid the least” and “what is the average salary of a grocer”. This leads to at least three problems that we’ve not solved, namely:

  1. How can we do a running calculation like the sum of the salaries;
  2. How can we then do a calculation on that sum to get things like average salary;
  3. How can we handle grouping a large set by one or more fields like job;

I said “at least”, and there are certainly some smaller unsolved issues inside these, and some that are peripheral but that we’ll probably bump against.

A set operation comes to mind. Not one whose name we know but one that we can imagine. I’ll try to describe it in words: the math is surely too hard to type offhand:

Given a tuple of records (a set of records with scopes 1 through some N) and a subset of the scopes in each of those records (like ‘city’ and ‘state’), which we’ll call “the grouping scopes”, produce a new set with one record for each combination of the grouping scopes from the original set and one additional element, the scopes of the records in the original set that have that combination of the grouping scopes.

In short, for each group combination, the record numbers of the records with that group combination.

It seems to me that it would be “relatively easy” to write a set operation that, given a set of records, produced elementary statistics on those records. It would just have to keep some sums of fields, squares of totals, whatever all goes into elementary statistics, and after finishing the sums, complete the stats calculation. The only example I can do without looking something up is the average salary would be the sum of the salary across all records, divided by the number of records.

Strictly speaking, it should divide by the number of records that actually have a salary field. In our typical application, they’ll all have the same fields, but formally that’s not guaranteed. And even in a relational system some people might have a MISSING value for salary. But those are just details.

If we were to use the operation describe above, we could presumably produce the group set, somehow, in one pass over the data. But to get the statistics for all groups, we would have to randomly read all the records of the large set. We’d only read each record once, but since the reads would bounce around, we’d probably page in the entire set for each statistic.

So we think of a more specialized statistics operation:

Given a tuple of records (a set of records with scopes 1 through some N) and a subset of the scopes in each of those records (like ‘city’ and ‘state’), which we’ll call “the grouping scopes”, produce a new set with one record for each combination of the grouping scopes from the original set all the necessary sums and counts to produce statistics for other scopes in the set.

In short, for each group combination, accumulate all the group statistics called for in some control object.

This operation could proceed in one pass through the large data set, and, however the grouping operation managed its groups, it would manage its statistics groups similarly. The resulting set would be much smaller than the grouping operation’s result as well

Tentative Conclusion

As much as I like the idea of grouping by just collecting record numbers, it seems likely that using similar grouping logic but accumulating actual statistics would be better in many cases, perhaps most cases. I don’t know, maybe all cases. I just work here, I don’t have all the answers.

Formally speaking, I am sure we could write out the membership condition for the grouped statistics set. Implementing its creation, however, might not be so direct. We can see roughly what’s needed. In the case of a small amount of data, we’d create a sort of dictionary whose keys are the grouping keys, and whose values are a record containing the various accumulated values.

And we could do that. To date, we create all our sets in memory and if we were to run out of memory, the program would crash. And we can almost certainly process any reasonable amount of data in memory. I plan to do some experiments along those lines at some point. But we should, in principle, be working toward a system that is only limited by available file space, not internal memory. We would need that for persistent data in any case, presumably.

Architecture and Set Theory

My System Architecture Sorting Hat and my Set Theory Sorting Hat place me in the same house: Separate Storage Layer. We should have a layer, roughly the one we have now, that manipulates structures or sets that it can treat as if they are in memory, with all the speed and convenience that entails, and somehow, behind the scenes, we move things to and from file storage as needed, or as the upper level may occasionally request.

Childs would have it, of course, that all the layers of the system should be set-theoretic, and that the interface between them should br set-theoretic as well. And I’m not here to say he’s wrong, but I am definitely here to tell you that I don’t know quite how to do that. I am fairly confident, however, that we could build something sensible that paged large data structures in and out as needed.

Fairly confident. I’m not sure how we’d do that with a hashed sort of thing that would let us do our grouping, but under the conditions that the key table for the groups exceeded memory. We would probably have to build some kind of B-tree or something similar. There are probably other structures that we’d consider.

Enough Speculation

The real need for a storage layer of the kind we’ve talked about here is a long way off. We may never get there. I have no doubt whatsoever that we could process all the useful queries for a million-record database right here in memory on my laptop. If we ran into trouble, I’d have an excuse to upgrade to an M3 Studio when and if they come out. Probably we’ll try some larger files as time goes on. And we can leave processing the NSA database of every person on earth to them. They probably already have a secret M3 Studio that they use to process all the phone calls in the known universe, in real time. Maybe even two M3 Studios.

Summary

What we have here is an example, as closely as I can write it up, of the design thinking that a team does along the way as it produces software. We won’t spend days up front devising solutions to these problems that may never arise, but we will often pause for a little while (minutes to an hour or so) to think about the future and to see whether there are probably solutions for the issues that we think might come up. Once in a while, we might be concerned enough to do an experiment. More often, a bit of thinking will convince us that we’re not heading for the wall.

Then we get back to work. We’ll do that in the next article. This one was about thinking.

See you next time!