FAFO on GitHub

In which, your intrepid author considers times now and times then, assessing whether and how to adjust what we do with XST given the new reality.

Hello, friends! I hope you are enjoying the new summer time as much as I am, or that you will if and when your clocks change or the actual season arrives. Here, we have snow. Last week it hit 70 (21C). At least the climate isn’t changing: it’d be scary if it were.

Over the course of these Extended Set Theory articles, I’ve been in occasional email touch with Dave Childs, the creator of the theory. It was invented with rapid access to data in mind, with a requirement that the access methods be capable of formal definition via a re-based definition of set theory. I do not personally believe that that much formality was necessary, but would freely grant that it does provide a very elegant and powerful way to express structures and operations that goes far beyond “relational”, which was the fashion at the time.

In those days, and when you try an’ tell the young people this today, they won’t believe you, computers only had 32,000 or 64,000 words of memory, maybe 256K bytes. Really big ones might have 8 megabytes. Today, my very old wristwatch has 768 megabytes of main memory and 16 gigabytes of solid state file storage. And it can make phone calls.

My laptop, one of the smallest that Apple makes, and two generations behind, has 16 gigabytes of main memory and just shy of one terabyte of file storage.

It seems to me that it’s fair to say that whatever XST might have been good for back last century, the game has changed enough that its utility now is surely different, if it is of value at all. I do think it has value, and in this article I propose to review that and to see whether we should take a different direction from the one we’re on and one that I’ve been contemplating.

From what I can glean from Dave and my reading, much of the speed of the XST systems of old came from their ability to rapidly process data at the byte level, combined with use of set theory to determine (mostly mentally) which bytes to thrash around. To over-simplify, it’s much faster to restrict a set given a byte string to compare, rather than fetching fields from records and comparing them.

I believe, but have not been able to verify, that the XST systems of the day were also very clever about paging data in and out from files, and, unless I miss my guess, did some very clever merging operations as well. The details of all these arcane manipulations are lost now, in the mists of time.

But it doesn’t matter.

I have been thinking of working out how to do fancy manipulations at the byte level, exploring Python capabilities like bytes and bytearray, and I grant that it might be fun to do, but I am typing this on a computer that would have absolutely no qualms about reading a million records into memory and sorting them. We’ll probably do that one of these days just to see how long it takes. You can be sure that back in ‘62 when I did it in a top secret room, deep under ground, full of magnetic tape drives, it took longer. Damn impressive to watch, though, I’ll tell you that. But I digress.

Purpose of the XST Exercise

Well. I’m made a bit nervous by such a challenging question. But there are some things I’d like to (continue) to get out of this exercise:

Software Development Practices
Overall, I do all these articles simply to show how I work, how I evolve a program rather than design it all up front and then somehow execute that plan, and to show the reader what happens when I work that way. Along the way, we learn and relearn and relearn the lessons of small steps, up-front testing, frequently running our tests, small commits, and so on.
Expression of Problem and Solution
I am interested in XST in particular because its formal nature lets us define powerful operations for transforming data, with each operation being quite simply expressed, and with the operations capable of being combined in a large number of ways to produce the results we desire.

Just as numbers can be combined to calculate any value, sets can be combined to produce answers to the questions we may have of our data. I enjoy discovering how nicely this works.

Opportunities for Discovery
As we do this, we encounter opportunities to learn and explore things about Python, and about programming in general. We’ve worked with generators, co-routines, any the deep innards of Python, learning what to do—and what not to do—as we go.
Keeps My Brain Alive
I freely grant that I have this daily ritual because it keeps my mind fresh and keeps the demons from taking up residence in my head. I have always loved programming, ever since it discovered me, and I find it just as fascinating now as I did over six decades ago.

I do this because it is fun.

Goals?

Some specific goals that we might explore do come to mind.

More Data Flow
The operators we have currently generally create actual sets. If we restrict one set, we get another set. Its elements and scopes are actually produced and are resident in the computer. When we solve a real problem with XST, the solution is likely to take a number of set operations. We’ll select these elements, restrict against that set, calculate these values, join with these, and so on.

Most of these operations could be streamed, processed one or a few elements at a time, so that intermediate result sets might never need to be represented in memory. Instead, individual elements would be passed along a sort of “pipeline” of operations. Such an approach would consume less memory, allowing our program to process larger sets, probably faster than we could if we materialize all the intermediates.

Optimization
Suppose we were to join two sets and then group them by some fields and then calculate some statistics. It would likely be more efficient to project out only the necessary fields from the two input sets, join those, then group and calculate. It might even be the case that grouping before the join would be useful: it would depend on the problem and the arrangement of the input sets.

It would be interesting to find a representation of a series of desired operations that our software could analyze and optimize. I freely grant that I have almost no idea how this might be done, although I do know some buzzwords that I could use if it would help. This might be great fun.

Higher-Level Interface
As the program stands, I could use it to write a specialized program to solve some problem. I’d have to write a main program, I suppose, that did whatever was necessary. It would be interesting to provide a user interface of some kind that could be used to create solutions to a wider range of problems.

Options include but are not limited to: Query By Example; Relational Algebra; SQL, and surely others. These would require me to learn how to build user interfaces with Python, something I have been carefully avoiding.

Byte-Oriented Operations
Even though the need for these seems to me to be much smaller, given today’s computers, the problem itself is fascinating: given a set of well-defined byte-thrashing operations, take a set-theoretic or even higher-level expression of a solution, and compile it down to the byte-thrashers.

Could be quite interesting, and it would even be fun to find out what, if any, effect it had on performance.

Immediate Direction

I think we’ll look around at our list of things to do, to see which ones may really be desirable, and then most as quickly as we can toward an implementation that works more around data flow, essentially processing small numbers of elements rather than creating entire intermediate sets. So, roughly this:

  1. Find simple clean-up or highly desirable changes that are needed now;
  2. Work toward data flow.
  3. … through 99. Then: PROFIT!

Some Code

I have a bit of time left this morning, so let’s take a glance at the list and see what we might program.

Here is a surprisingly tricky one:

  • Lex underbar ala total_pay

Our expressions do not allow field names like total_pay, because this is our lex:

    def lex(self):
        rx = '([^a-zA-Z0-9.])'
        split = re.split(rx, self._expr)
        return [item for item in split if item and item != ' ']

We treat everything that is not alphanumeric or . as an operator. We can add underbar to this list, which will allow names like total_pay. It will also allow numbers like 123_456. Of course our lexer already allows names like total.pay, which will work but isn’t what we had in mind. I think we’ll make the change, make sure nothing breaks, and then make a new note saying we might want a better lexer at some point.

Done. I change a test to use an underbar in one of its names. Green. Commit: lexer allows underbar in names (and numbers) New note made: Need improved Lexer someday.

Here’s one:

  • re_scope use comprehension? (Uses SetBuilder. Scope_Set to dictionary?)

Here’s the method:

class XSet:
    def re_scope(self, other) -> Self:
        try:
            return self.implementation.re_scope(other)
        except AttributeError:
            return self.generic_re_scope(other)

    def generic_re_scope(self, other):
        sb = SetBuilder()
        for e, s in self:
            for old, new in other:
                if old == s:
                    sb.put(e, new)
        return sb.set()

OK, first of all, no, we cannot readily use a dictionary to speed up this operation: the scope set can have more than one re-scoping for the same old scope. We could re-scope x to x1 and to x2, so a simple dictionary is off the table. (A more complex one, leading from old name to a list of new names, might be ok.) I think we’ll defer the dictionary idea for now.

The idea of SetBuilder came up a while back because creating tests was such a pain. The above code is the only place in production where it is used and it’s not really helping much there.

SetBuilder does build up the entire set as a collection, so if we instead used a comprehension there would be no additional memory hit. Can we write such a complicated one, and if we do, could we ever understand it again?

    def generic_re_scope(self, other):
        tuples = ((e, new) for e, s in self for old, new in other if old == s)
        return XSet.from_tuples(tuples)

Commit: drop use of SetBuilder in generic rescope, use comprehension.

I think that’s better. Now we need a new note: Should SetBuilder be maker tool only?

I think it’s time to wrap up, I hear breakfast noises in the kitchen.

Summary

We’ve reviewed our situation, revised our direction just a bit, and improved a bit of code. A pleasant morning.

I find myself wishing for a better sense of when to use a generator comprehension and when to use a list comprehension. I am not as clear on the differences and issues as I would like to be.

See you next time!