FAFO on GitHub

I had thought that I’d work on the flat data. My thoughts led me astray, to a glimmering of a possibly good idea. So I’ll just code something cute to finish the morning’s work.

Hello, friends! What shall we do this morning? One task that lies before us is to deal with the XFlatFile and its supporting class XFlat, which are a bit more under test than much of what we’ve done.

In an important sense, all of this is experimental, as much a learning experience as anything practical. Not that anything I program and write about here is actually practical. But I would like to push the XST stuff to the point of practicality. One way of framing that notion would be that I’d like to reach a point where I could write a definitive paper or book on implementing Extended Set Theory For Practical Purposes. Not that I would write either of those, necessarily, but I would like to know enough that I could.

Part of the strength of XST is supposedly that it can readily model and deal with any data structure. As it is fundamentally just math and is well-defined, I think that’s pretty much a given. But it’s also, supposedly, a reasonable basis for implementing such a system, not just modeling it. But even here, there are some key options.

Symmetric Difference

Let’s consider the sym_diff operator from yesterday. Its formal definition goes something like this:

A Δ B = { xy | (x ∈y A & x ∉y B) or (x ∈y B & x ∉y A) }

With a bit of math, it is “easy to show” that, switching notation:

A.sym_diff(B) = (A.union(B)).diff(A.intersect(B)), or
A.sym_diff(B) = (A.diff(B)).union(B.diff(A))

And so on and so on and scooby dooby do. But suppose we were “modeling” the behavior of some code using set theory, and we suddenly realized “hey, this renaming thing is just a symmetric difference”. (It could happen.) What does that tell us about how to implement the renaming thing? Des it require us to do a difference of unions? Or a union of differences? Or any particular code at all?

It does not. The thing we’re doing is a symmetric difference, even if we open code it in assembler. One way of looking at XST is that we use it to discover the essence of our data and our problem, and to guide our implementation.

However.

It’s perhaps possible to see that if we have a suitable library of useful set operations, expressing parts of our problem solution will be both easier, and more reliable. Easier because powerful operations are packaged into essential bits that we can understand, and more reliable because a lot can go wrong when we open code renaming in assembler and much less can go wrong when we use our carefully tested symmetric difference operator.

And so … some optimistic folx have tried to use XST to build practical software, to greater or lesser success. There is no commercial XST-based product out there, to the best of my knowledge, so I think it’s fair to say that the successes could have been greater.

Here, in these pages, we’re trying to build some Extended Set Theory code that could be practical. We may or may not succeed. Today, I am (again) one of those optimistic folx who want to give it a go.

Efficiency

Our basic operations so far aren’t horribly inefficient, but they aren’t magically fast either. And when we did our symmetric difference, it looked like this:

class XSet:
    def sym_diff(self, other):
        d1 = self.diff(other)
        d2 = other.diff(self)
        result = d1.union(d2)
        return result

Three operations. Two diffs and one union. Both diff and union create Python set instances and diff or union those, relying on whatever Python does to do the actual work. I think we can be sure that a hand-crafted symmetric difference might be faster than what we have here. However, getting it right might be another matter: it’s a bit complicated. Still, if XST is too slow, it’s not practical to use it.

Now for large amounts of data, more than we can fit in memory, the speed of the secondary storage is always the bottleneck. Well, always unless our in-memory operations are truly horrific. Most of the operations we have take time that increases linearly with the size of the set being processed. Sometimes the increment grows linearly with the size of a secondary set. Given a relation of 1000 records, restricted by two fields, we’ll do approximately 2000 fetches of fields.

Assuming with some credibility that the number of fields in a record is small, restrict will run O(n). I think that the assumption that we’ll be limited by our storage speed is fairly legitimate, but it’s just a guess.

XST offers the prospect of great speed if we take advantage of what it can do, but experience suggests that it’s not easy.

Recall that select preserves the original scopes of the elements it finds. If from

{ a1, b2, c3 }

we restrict out a and c, the result set will be:

{ a1, c3 }

If we were to take the scope_set of that, we would get

{ 11, 33 }

What is that set? In essence it is the “record numbers” of the records meeting the selection criterion. If we were to cache a bit of information, and someone ever reruns that query, we could simply fetch the records via random access, saving all that reading and comparing.

So part of the “promise” of XST is that, because our operations are well defined and made up of intelligible atomic operations, we might be able to do some pretty fancy indexing operations.

Suppose we saved the scopes for last_name=jeffries, and, separately, the scopes for job=serf. If we ever get asked for all the serfs named jeffries, we can simply intersect those two scope sets and what’s left is the scopes of all the jeffries serfs.

Idea

That gives me an idea. There’s some operation that returns elements of a set whose scopes are in a given set. I used to call it up_arrow. I don’t know what Childs calls it. It’s really a lot like subscripting, if the set in question were an array.

Added in Post
Wait! Doesn’t re-scope do that? I think it does. Wow! Here’s a test showing it:
    def test_re_scope_is_up_arrow(self):
        ntup = XSet.n_tuple(["a", "b", "c", "d"])
        sels = XSet.from_tuples(((1, 1), (3, 3)))
        selected = ntup.re_scope(sels)
        assert selected == XSet.from_tuples((("a", 1), ("c", 3)))

I wonder if we could do a sort of “virtual record”. Suppose we have a list of set elements, such as we would have with a custom implementation of n-tuple. Suppose we select from those records, on some criterion, but instead of returning a set of recordindex, we return a set of “indirect records”. When this set is enumerated for any reason, it would fetch the real records at that time. It would never hold on to them, only to the numbers.

I think we’ll have to explore that idea. I can think of two ways we might do it:

  1. A new XImplementation subclass that does record-reading behind the scenes. This would be pretty straightforward, I think.
  2. A new kind of XSet (XImplementation subclass?) that contains an expression that can be evaluated to produce the set.

I’ve been thinking that if we were going to save expressions, they would have to go “above” XSet. Is it possible that they could go inside, as an XImplementation? Or … is there another pattern we should follow.

Very small p for this p-baked idea. But p > 0. We’ll think about it.

A little code

I had thought that we might work on finishing up the XFlatFile and XFlat code and getting them moved to the right files and such. Thinking has taken me in a different direction. I do want to do a little code and I think this will be just right.

class XSet:
    def __add__(self, other):
        assert isinstance(other, self.__class__)
        return self.union(other)

    def __sub__(self, other):
        assert isinstance(other, self.__class__)
        return self.diff(other)

    def sym_diff(self, other):
        return (self - other) + (other - self)

Yes. We override + as union and - as difference. Good idea or abomination? You tell me. I’ll find out from my peers this evening.

Commit: implement plus and minus as union and diff.

With that, we’ll close.

Summary

This is an exploratory effort. We do hope to have some tangible result, showing that XST in Python can be useful. But it’s still exploration, something I do in the mornings to keep the demons away and to keep my mind as well-oiled as it can be. So we’re not under pressure to deliver any particular thing … not yet.

As such, a morning like this one, where we start out thinking we might work on some particular thing and then wind up thinking something entirely different, and then maybe we do a little thing just to have a commit by our name, in case the boss is checking the dashboard, well, a morning like this is just fine.

And if it leads to a working scheme like the expression set or even just the record-grabbing set … that will be well worth it!

See you next time!