FAFO on GitHub

The Powers That Be have invited me to discuss the risks around this Python XST effort, on the assumption that it is anything other than play.

Hello, friends!

We’ve been working on XST in Python now for about 50 or 60 articles, since about February 4th. It’s March 12 right now, according to my computer, so not all that long. Long enough, though, to ask ourselves, “Selves, if we were exploring doing a database kind of product with this XST stuff, do we think we could do it, and what major risks remain?”

And I think we would answer, “Well, selves, we think the biggest remaining risk is large datasets. We can stream large data in from files, but after that all our data is in memory. So we could, in principle, exceed memory. Right now the system would just crash.”

The conversation would go on, and in fact we here in my head have had a lot of the conversation already, and we have some general and somewhat random thoughts about mitigating the concern.

  • Flat files are the only potentially large input we currently have. They’re basically the only external input we have at all.

  • Generally the operations we do make the amount of data we’re processing get smaller, but there are potential exceptions such as joining sets that we should explore.

  • We do not know how large a set we could process in memory. Some experimentation may be needed.

  • While the system is capable in principle of creating arbitrarily nested quite weird sets, in practice, we will not be using it to do much of that, and in no case would we b likely to create a very large and also weird set.

  • We could “easily” arrange to write large flat sets out to files. In principle, any set could be pickled and written to a file, but that is often slow and it’s never clear what’s going on.

  • Given a large set and some set operation to be done to it, it is probably always possible to split the large set into the disjoint union of a number of smaller sets, process the smaller sets, and then put the results back together. This might require a sort-merge kind of operation.

  • It would be straightforward to write a flat set out to a file at any time. And, of course, to process it later, like any other flat file.

Tentative Conclusions

  • We can probably (p > 0.9) do a reasonable relational-like application without concern over handling large sets, as far as capacity is concerned. We are less certain about speed, but it doesn’t seem very concerning either.

  • If we were to limit our large sets to flat sets where all the records have the same fixed-length fields, we can probably devise a way to partition the work on a large set into work on smaller sets, and then reconstituting the results.

  • We can and should limit the sets we create as part of the application to be sets of records (i.e. sets of sets all of which have the same scopes) wherever possible, because such sets can surely be saved to files as necessary.

  • We should explore the limits of memory in Python, and learn how to measure memory usage and detect issues before it is too late.

  • We should explore some sort, merge, union, split, recombine kinds of solutions, working out the set theory and operations needed to allow us to partition work if need be.

  • We should seriously consider picking an application to implement, so that we can show progress more visibly to whatever wealthy person or enterprise is backing this effort.

We’ll integrate these concerns into our work. At this writing, I think we’ll want to complete some partially done work, notably calculations and selections, before layering in anything additional. We’ll continue that later today in a separate article.