FAFO on GitHub

A bit of research into the past of Extended Set Theory suggests an interesting possibility for the future, and provides the joy of working something out on my own. Includes some possibly amusing history about how primitive man created such documents.

The approach to XST that I’m taking here, and have taken in prior exercises, is quite different from the original approaches taken decades ago. To the best of my knowledge, the earliest systems had a very simple data model, essentially fixed length tuples of characters. A given set was just a flat file of characters, arranged in records of equal length, and treated by the underlying set operations as essentially nothing more.

I emphasize “to the best of my knowledge”. The creator of Extended Set Theory has always been quite reticent to discuss actual implementation of the theory in software, despite claiming at the same time that implementations of the theory can provide great benefits in performance, as well as other benefits that come from a compact, consistent, formal way of describing information, data, and storage. That reticence continues to this day: the creator will not discuss how his software did what it did. He has every right to do that, of course.

It is entirely possible that I have spent more thought an effort on Extended Set Theory than any living software developer, and I will freely grant that my results have been less than compelling. I would also say that I’ve found them interesting and evocative. I have personally never taken my experiments all the way to a real product, and the closest a team of mine ever got to a product was the production of a very fine program that was never marketed. A downer for everyone involved, I am sure. I myself had been driven out of the company before all that happened.

I have managed to find a couple of old old old documents that describe a bit about how the original program, “STDS” did its work, and I’ll include links to those below.1 2 3 I’ll summarize here what I’ve learned.

STDS

This is to be taken as my best understanding. I could be mistaken in many ways.

STDS’s internal data model was what I would call “packed arrays of fixed-length arrays of characters”. The sets were all of a size that would fit in memory. Larger sets were represented by a list of these smaller sets. The documents I’ve found are hard to interpret and—I believe—inconsistent, but apparently these data sets were kept sorted character-wise and were therefore readily merged and differenced and such, as needed, with simple character-focused comparison operations. The text also refers to “binary search” being used when it was necessary to determine whether a given record was or was not in a given set.

Possible inconsistencies included the claim that a deletion of a record was accomplished by moving the last record to the slot of the deleted record, and the claim that an added record was placed at the end of the set. Either of these operations would break the sort. No mention was made as to how this was dealt with. If I were going to maintain the sort, I would personally not choose either of those places to start.

Relatedly, the paper also claims to have implemented a sort that was linear in the number of records being sorted. Interesting, to say the least. Of course, inserting a single record into a sorted set can be done quite quickly.

There is some arcane text and an eldritch diagram describing the data structure, but as far as I can glean it’s just a diagram showing a collection of the individual component sets (called “generator” sets), and another structure showing that a given set is the union of a list of the generators. If the system ever recorded any other construction scheme than union, I haven’t discovered that claim, but it does seem to me that if one is going to allow for a set to consist of the union of two other sets, one might as well allow for intersection or difference or other operations.

It is “easy” to imagine that when a program called for the intersection of two of these sets, the fact might just be recorded in the system tables, and then, when that set was used, one might only then do the intersection and materialize the set. The reference to the intersection operation could be replaced with a pointer to the now actual set, or the expression could be saved as well, so that if, later, another request was made for that intersection, its saved value could be discovered and reused.

Aside
I hadn’t had the above realization quite so clearly in mind until I wrote the paragraph above, but now I want to reread the relevant articles to see whether there is evidence that STDS actually did that. And I am tempted to do an experiment to show how it might be done.

The STDS “programming language” was very low level, essentially amounting to single function calls calling for some operation to be performed on some input sets, producing one or more output sets. The job of the STDS user included all the hard work of modeling actual files in terms of fixed-length records of characters, and if there was any built-in use of field names or the like, I have not been able to discern it. It might have been in there.

How Did Primitive Man Create Documents?

That’s about all that I “know” or surmise about the history, as reflected in the ancient materials, often inscribed in an obsolete form called “Courier”, where, apparently, if you can believe this, letter forms were applied to a material called “paper” using carved or cast metallic presses that, applied with vigor to a pigment-impregnated ribbon of cloth, would impress a vague image of the letter in question on this “paper”, resulting in a thin sheet of material upon which words could often be made out. Separate research suggests that this so-called “paper” substance was often made by boiling wood pulp and pressing the resulting substance into flat sheets that were then dried in the sun and trimmed for use. Quite amazing, all of it.

Back to Today

I had thought of learning enough about how the old system worked to apply the ideas in the program we’re working on here. My current understanding makes me believe that directly implementing the “array of characters” operations in Python is not feasible. It might be possible to do things inside the Python “buffer” protocol, but even there one would almost certainly want operations for thrashing the bytes around without creating and destroying Python objects. We might have to write a C library for byte-thrashing.

Yeah, no. Not gonna do that.

But could we do something similar but different? Let us speculate.

Rank Speculation

Our XFlatFile object contains a very large Python string of characters. Our usual operations slice out a substring amounting to one record. When the program wants the fields (scopes) of that record, the fields are extracted at that moment and presented as standard 2-tuples (element, scope).

Suppose we wanted to do a selection from a flat file, perhaps all the serf records. We might do that with restrict against a set containing {serfjob}. In a flat file, the same result could be obtained by finding all the records where characters 24 through 35 contain ‘serf ‘. (The job field is 12 characters long in our data set.)

I have to let the code participate in this speculation.

    def test_find_serfs(self):
        new_names = XSet.from_tuples((('first', 'first_name'), ('last', 'last_name')))
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        buffer = ff.buffer
        result = []
        for rec in range(1000):
            start = 44*rec + 24
            job_field = buffer[start:start+12]
            if 'serf        ' == job_field:
                record = buffer[start:start+44]
                result.append(record)
        assert len(result) == 200
        packed = ''.join(result)
        assert len(packed) == 8800

That test slices the ‘job’ field from each record in the buffer and compares it with serf (expanded to 12 characters). If it matches, the entire record is sliced out of the buffer and put in the list. The list is checked for the right number of records, and then packed down to a single string using join. That packed data would make a suitable buffer for a flat file set (with no file name).

It would “just” be a SMOP to convert a restrict on an XFlatFile to do something like that.4 5 I wonder if it is slower or faster than an actual restrict on the same data. Notably faster, I’d bet. I’ve made a note to check that, someday. Maybe.

So, we can speculate that if our set processing involved a lot of large sets that could be represented as regular arrays of characters, we might be able to get some interesting performance, even in Python, by judiciously slicing and dicing the array. There might even be a way to use the buffer protocol to avoid the join by moving the record directly from one buffer to another. I must look into that: it’s likely that the join is the most expensive part of the code above.

Summary

I think I’ll close this out here, having gone from musing about the past to an idea for the future. As much as I wish I knew exactly how the old STDS program really worked, I’ve had more fun from only knowing a little and speculating about how it might work and how our 2024 code might work. As is often the case, working out an answer, perhaps given hints and ideas. is far more fun than being handed the answer.

Thanks, Dave, your reticence has given me some pleasure that an explicit answer never would have!



  1. Dave Childs article for Lawrence LIvermore Labs including some details of STDS storage and operation. 

  2. NASA Langley Report on prospects of set theory, including some description of operation 

  3. Technical Report including description of STDS storage. 

  4. “Just”. We know what that means. Roughly, “Here there may be dragons”. Still, with that code in hand, how hard could it be? 

  5. Small Matter Of Programming. cf. “just”. But really, how hard could it be?