XST 3: Why, and What?

Last night I understood how to do something with XST that I’ve not in the past been able to do. So let’s talk about why XST is interesting and what one might do with it.

Extended Set Theory (XST) is a formal mathematical definition of set theory that makes it more capable of modeling data information systems. Since people have been writing data information systems for decades, one might wonder why we need XST.

Truth is, we don’t need it, any more than we need Python. There’s always FORTRAN if we didn’t have Python. But Python is a better tool, for some things, than is FORTRAN. Similarly, XST purports to be a better tool for defining database kinds of systems.

For example, if we have a huge number of large records, we’re going to want to build some indexes. We do that in a careful but ad-hoc fashion, say by extracting the index field into a table containing the field and the corresponding record number in the big table. Then, when we need some small number of records, we search the index, get the record numbers, and use those to fetch just the records we need.

It’s all more or less straightforward and it has been done for decades. XST allows us to make a formal model of that operation, one that can often be written in just a few lines of math. Now we have a formal specification for the notion of indexing, one that is easy to check for correctness, and one that can often be implemented in code in a way that makes it easier to see that the code is correct.

And … I hope we’ll see this in a later article … if we implement set operations at the base of our system, we can often take the formal description of the operations to solve a problem, and execute them directly.

The above are all little more than claims. They are claims that Dave Childs has made, claims that I have made, claims that other folks who have learned about XST have made. The big question has always been “OK, how can we actually do that?” and the answer has never been provided by Dave. Dave does not claim to have the answer. He is a mathematician and he will teach you more than you want to know about this model, and the implementation, if any, is up to you.

That’s entirely fair, although I have to admit that it is frustrating to hear that this formalism makes these wonderful things possible, and to get very little help in how to do it. That’s not a knock on Dave: code is my realm, not his, and by my lights, his claims are correct: if you implement these ideas in code, you can get some amazing and powerful capability, with the greater reliability that comes from having a formal basis.

I’ve used up three jobs and two companies trying to do this, with varied success. One company got a really powerful product that they never understood, but ran out of patience with having me around. One fairly large company folded because the entrepreneur ran out of money. And Lee Johnson and I sold six copies of our little PC product.

And yet, every decade or so, I come back to this idea, because it’s fascinating, and it’s powerful, and it’s fun to implement the bits that an individual can implement. This is the 2020s’ turn.

Let’s talk about what XST might be able to do.

What Could XST Do?

We currently have a couple of operations implemented, that can do unions and intersections of sets that are internally represented by a table of scopes containing a table of elements. (An alternate word for the primitive scopes and elements is “atom”. We are using strings and numbers as our atoms.)

When we execute this code:

            local A = XSet()
            A:addAt("Jeffries", "last")
            A:addAt("Ron", "first")
            A:addAt("Hughes", "last")

The result is an arcane table containing those elements. And we can process it with a few operations, and it’s clear that we can write more. Each of those operations will be programmed to adhere as perfectly as we can to the math of “intersect” or “union” or whatever.

But in a database we would probably have some kind of fixed-length records, with fields like “first” and “last” and “age” allocated in fixed-length fields.

We might have internal tables to represent the layout, something like

name	start	end
last	1	20
first	21	40
age	41	43

When we want to retrieve all the records with “first” equal to “Ron”, we can go at least two ways. One way would be to read all the records, reconstitute them into our neat little XSet format, do the “restrict” operation (not yet defined here), and voila, there are the records we want. We could pack them back into the table format if needed.

Another way, arguably better, would be to read the records in their flat form, and compare characters 21-40 with “Ron” (expanded out to 20 characters), and if the compare matches, move all the characters of that record to an output buffer.

It’s “easy” to see how to do that, and it’s not terribly difficult to program. And we can of course see how to extend that to allow us to find all the records with first name “Ron” and age greater than 21, plus all the records with last name “Hoover”.

After a while it stops being “easy” and starts getting “messy”. But this has been done for decades. We could do it again.

Now let’s think about our retrieval of first name = “Ron”. It turns out that there is an set operation called “restrict”, which, given two sets, returns all the elements of the first set whose intersection with one of the records of the second set is equal to the second set.

If we were to “restrict” our database table with the set

{ { Jeffries@last, Ron@first} }

That would select all the records where last is Jeffries and first is Ron, but not records where last is Jeffries and first is Mike, nor records where first is Ron but last is Perlman.

OK, fine, so what? Well, it is possible to define our flat table as being made up of sets (records) somewhat like this:

{ J@1, e@2, f@3, f@4, r@5, i@6, e@7, s@8, M@21, i@22, k@23, e@24, 8@42, 1@43}

And our restricting set would look like this:

{ J@1, e@2, f@3, f@4, r@5, i@6, e@7, s@8, R@21,o@22, n@23 }

And by the existing definition of restrict, it would select just the characters of all the matching records.

Fine, but that would be slow, iterating over all the one-character elements of each record. Sure. But suppose that we had another implementation of restrict, focused directly on comparing byte strings, based on manipulation of the symbol tables (sets) that define the records, and we dispatched down to a custom-made “restrict” operator with the same formal definition but a fast byte-streaming implementation.

Then our retrieval would have a short phase of parsing the input, looking up values in tables, and executing high-speed byte-oriented set operations. Formal all the way down, much easier to create the formulae and easier to see that they were correct.

So. The promise of XST is that we can transform the data into any convenient form and process it there, using the same–or much the same–operations as we use to define it at the highest level of abstraction.

I say the promise. As far as I know, there are no running commercial implementations of XST in the world, and the few that have ever existed fell short of accomplishing the whole promise, although the Comshare implementation did some pretty amazing things.

Now, I’ve studied this subject long enough, and worked on implementing it enough times that I feel that I can explain why no one uses it if it’s so good.

The reasons are becoming visible in what I’ve written above. It’s pretty easy for us to sit down, draw out our flat records, decide how to allocate them across disk sectors (do they still have those?), and how to write code that will search those sectors efficiently. Another day or so and we can draw out how we’re going to create a sector index or a record index. We know how to program that sort of thing. We do it all day.

To do the XST treatment of the same thing, we first need to understand the math. And the math is full of notation, there exists, for every, element-of, scope transform, intersect restrict … it takes a cubic bunload of math to really define the relationship between a simple query and the bytes in the computer, especially if we want to include the disk storage formatting in there.

So to get XST right, you have to understand the math, you have to be able to apply it to solve your problems, in a general fashion, and you then need to implement all the operations necessary to carry out your tasks.

An effort that seems to take only a few days in the ad hoc world seems to take weeks and months of learning. Maybe at the end we’ll have something better, but given that no one seems to have done it, it’s a Career Limiting Decision to decide to implement a set engine.

If you get to the end, you’ve got something very fine. If time, patience, or money run out, you’re on the street.

Why The Hell …

Why the hell am I doing this? Well, the decade has come around, and it’s fun to do. Even better, I think I see a way to build something I don’t recall personally accomplishing in the past. I think I see how to make an abstract “restrict” operation work down to a very string-oriented restrict operation.

If I’m right … well, neat. If I’m wrong, that’s OK too, because here in the privacy of my living room, if I fail miserably, we all get a good laugh out out it.

Let me sketch what I’m thinking. It’s probably not quite what we’ll do, but it’s a stake in the ground.

I propose to implement the restrict operator, as roughly defined above. It comes down to creating a set containing all the records of A whose intersection with any record of B is equal to B. That will be easy. We’ll do it in the next couple of days, I expect.

Then I propose to define another kind of XSet, where the whole set is a single string, whose structure will be a series of fixed-length fixed-layout records packed together. This FlatSet will have an associated set that defines the XSet field-name scopes by the character offset in the string, like in the table above.

It seems obvious that we could write FlatSet:iterate to produce the scopes and elements expected by all the various operators who use iterate to do their job. We would essentially reconstitute records on the fly.

Easy.

But we could also implement restrict directly on the FlatSet class, and code it with some rack of substr operations so that it can fly over the table comparing bytes and copying whole records into a new FlatSet, which we can reconstitute if and when we want to.

In a small but significant way, we’ll be doing what XST promises, processing information without needing to know its internal format. (Of course the objects will know, but at the top, it’ll just be a restrict like all restricts.)

So that’s my plan. Why am I doing it? Because it’s there. Because I can. (Or at least think I can.)

Follow along. It’ll be interesting code and it’s always fun to see me mess up.