FAFO on GitHub

I want to begin by thinking about storage: how do we best produce specific memory / file formats? The article turns out to be pure speculation, but perhaps useful speculation.

Hello, friends! Happy Fat Tuesday!

At any point, feel free to skip to the summary, but there are, I think, some slightly important bits there.

I was reading some of the old XST papers, and some of my ancient words on Ward’s wiki. I may have been more fanatical then that I am now. Or maybe just a couple of decades younger. Still, I stand by most of what I said there, and it didn’t seem that I was being too rude or abrupt, just trying to explain why anyone in their right mind would consider XST as an approach.

There is a key phrase up there about one’s mind, and we’ll not be exploring it further at this time. Suffice to say that I find the topic fascinating, because I can almost see the path from theory to code.

There is absolutely no doubt in my mind that XST can be used to describe any data structure that we can imagine. That part is almost easy. The part that is much harder, for me at least, and for everyone I know who has ever tried to implement XST, is the transition from the theory to the code. To that end, today I want to explore storage, and in particular, the mapping from a flat file structure to set theory, and back.

Let’s consider a single record of a flat-format relation containing last and first name. I think our current preferred set structure for Bill Jones’ record would be:

{ Joneslast/, Billfirst },

where all of “Jones”, “last”, “Bill”, and “first” are strings.

We’ve already created records in that form and processed them with set operations like restrict, which can select all the records matching a given set. But while that is interesting and powerful, we have a serious question to consider, which is how we could get from a flat format to our preferred form, and back.

There is a deeper question as well: do we even want to get to our preferred form at all? It’s sort of basic and canonical … but is it really preferred?

Recall that by the time it comes down to processing set operations, we simply require some 2-tuples containing element and scope. In fact, since we write all our processing in terms of e and s (element and scope), we aren’t really even tied to tuples.

Note to Self
I think we should try to move this thinking to code very soon. Arguably the X_tuple, which is started, is an example. Or we might really want to go further than that. XFlat? One way or another, we should try to make these ideas more concrete.

In our imaginary relation where Bill Jones resides, the file is flat, and has two five character fields packed into it, like this:

JonesBill 
Hsu  Xi
SmithJason

Hm. Implementing this would almost be easier than talking about it. There may be a lesson there. Anyway …

To iterate that set in a general way, we would need an associated symbol table, associating field name with origin in the record and length, something like

first 5 5
last  0 5

A key fundamental operation on any set, like the Bill Jones record, is to iterate it, producing its elements and scopes, pair by pair. Here, given the JonesBill record, we’d iterate the symbol table, producing “Bill” and “first”, and then “Jones” and “last”.

To check includes(element, scope, we’d work similarly, looking up the provided scope in the symbol table, then unpack those characters and compare them.

Easy-peasy. -ish.

That’s so easy (probably) that I would probably not even try to figure out the set theory for it. And, let me be clear here: I am not at all sure how to do it.

Our code so far treats strings, integers, floats, and XSets of all kinds as legitimate elements and scopes.

One formal set-theoretic representation of the Bill Jones record, in storage, might be

{ J0, o1, n2, e3, s4, B5, i6, l7, l8, \b9 }

(XST chooses to start its n-tuples at 1. Having tried that above, I decided to start this thing at 0. It’s not an n-tuple, but it is a legitimate set, and I think it might be easier to think about.)

So we have that record, described as a set of characters. There is a set operator that can re-scope a set, which can be used to “renumber” things, so that we could tell it to re-scope the record above to this one:

{ B0, i1, l2, l3, \b4 }

That’s pretty close to a string, I guess. Since XST does not address the types of the element and scope, as far as I know, we’d need some ad-hoc operator to convert that to a string. From the viewpoint of set operations, they wouldn’t care how the string came to be, they just compare it to other strings. And, if we wanted to, we could represent all our internal strings as vectors of characters. The set operations would still work. They’d probably be 100x slower, but they would still work.

I think we would more likely observe the formal description, and posit a function that takes n characters and returns an n-character string, and then forget about it. Essentially, as we have already done, we draw a line between the formal behavior and the code. That line, today, is the interface definition for XImplementation:

class XImplementation(ABC):
    @abstractmethod
    def __contains__(self, item):
        return False

    @abstractmethod
    def __iter__(self):
        raise NotImplemented

    @abstractmethod
    def __hash__(self):
        raise NotImplemented

    @abstractmethod
    def __repr__(self):
        raise NotImplemented

Now, at present, we really only have the one set structure, the one in XFrozen, which is a frozen set of tuples, all the way down. We are on the verge of being able to create alternate structures, like X_tuple, which allows storage of an n-tuple without explicitly including its integer scopes. Our partial code is this:

class X_tuple:
    def __init__(self, data):
        self.data = data

    def __contains__(self, t):
        e, s = t
        return isinstance(s, int) and s <= len(self.data) and self.data[s-1] == e

The data member here is intended to be a tuple of values, and, for __contains__, the scopes can only be integers from 1 to len(data). (Hmm, should check for > 0 there.)

We’re just trying this structure to get a feeling for alternate structures. In memory, there won’t be much difference between explicit and implicit scopes, for reasonably-sized sets. Well … maybe it would be worth it. I don’t know.

But the point is, by the time we have done any processing at all, we have our main XFrozen implementation in hand. If we had a huge relation with thousands of records in a flat format, we would be unwise to expand that whole thing into the XFrozen format all at once. Memory would explode.

So we might have a set format XFlat, and we might want results from XFlat sets to remain XFlat, instead of turning into XFrozen, as they will today.

A light begins to shine …

I’m starting to see what might work. I pity anyone who has read this far: you may be terribly confused. Bear with me, or bail out: your choice. Skip to the summary

We have a very simple set operation, project. It takes a set of records, and a list of scopes, and returns a set of records containing only those scopes from the original set:

    def project(self, field_selector: Self) -> Self:
        projected = [self.project_one_record(record_element, field_selector)
                     for record_element, record_scope in self]
        return XSet.classical_set(projected)

    def project_one_record(self, record_element, field_selector):
        new_atoms = [(field, field_name)
                     for field, field_name in record_element
                     for desired_field_name, _ in field_selector
                     if desired_field_name == field_name]
        return XSet(new_atoms)

I’m not sure what the official set operation is for project_one_record and didn’t want to scrounge around in the Childs writings to figure it out. Possibly it’s just rescope: I’d have to think about it.

But we’re working toward set operations offering the implementation the opportunity to perform the operation. The basic idea is that the XSet level of the operation would try:except: a call to the implementation, which would either do the operation or raise the NotImplemented exception, in which case we’ll handle it at the top.

So we could, in XFlat, handle project_one_record and instead of returning an XSet with an XFrozen implementation, return one with an XFlat implementation.

That gets us a long way down the road toward handling flat-record relations on files: we at least have the record in the right format.

There would be more work to do, including getting the stuff read in (fairly easy I think) and written out (rather more difficult, I think), but it might be a step on the road.

Let’s sum up. I do think there’s a point or two here.

Summary

Forgive me if you read this and found it boring. I hope you skipped here if it got too fluffy. I’m going to wax meta-fluffy for a moment, then create a tentative plan.

What happened here is design thinking, at least the way that I do it. I hold random bits of idea: problem, theory, code in my mental hands. I move them around, try to fit them together this way or that. After a while a couple of things may seem to fit together, and I begin to see a path that is interesting. Or, not uncommonly, nothing seems to fit and I put down those pieces and pick up others.

To me, it’s like assembling a sort of 3D or nD puzzle made up of odd bits and shapes. I don’t try to force them. I let my thoughts go where they’ll go. Sometimes they loop. Sometimes they signal “dead end”. Sometimes I just get tired and stop.

But almost always, if not after one session, then after a few, I have an idea for something to try.

What I propose to try, soon, is to implement a new kind of XSet XImplementation, XFlat, which will comprise a symbol table mapping field names to offsets, and a character string containing the fields in question. For the first cut, I’ll have all the fields be strings. The extension to converting some of them to numbers will probably follow.

I think it’ll be “easy” to create another kind of set, XFlatFile(?), that covers a flat file and produces XFlat records as it iterates. I think it will be “less easy” to figure out how to use our set operations to create such a file. By “less easy”, in this case, I mean “I don’t see at all how to do this”. But that’s for another day.

For this day, musing has illuminated a bit of a path forward. And that’s all we need: a path forward.

Tentative plan …

We’ll first take a look at the XImplementation stuff, to make sure that it’s close enough to solid to move forward. Then we’ll start writing some tests of converting flat records to something that can run through set operations. Probably at first, we’ll just produce regular XFrozen sets from them. Then we’ll work on keeping them flat.

I do not see how that will go. I think I do see some steps that will let the dim light of my mind illuminate a few more steps. That’s how I do it.

See you next time!