PyXst: A small step

I think that this morning, I’ll take a small step toward having more than one implementation of a set’s data.

Hello, friends!

Today we’re going to start on giving XSets more than one internal representation of a set. I think I’ll try to refer to this level as a “data” implementation: how do we implement the data structures that support XSets.

The notion is this: An Extended Set is essentially a collection of ordered pairs, element+scope, where the element part is much like the element we find in an ordinary set, and the scope is a sort of label for the element. Neither the element nor the scope have any particular formal meaning, but informally to me, the element is the chunk of information and the scope is where it is in the set.

In an ordered n-tuple (think list, one-dimensional array) <a, b, c, d>, ‘a’ is in the first position, ‘c’ in the third. And in fact the extended set definition of <a, b, c, d> is

{ a¹, b², c³, d⁴ }

The 1, 2, 3 and 4 there are the scopes, and the elements are the a, b, c, d. If you’ve been reading along, you’re probably aware of this.

Now in our present implementation, the set above, the ordered 4-tuple, would be expressed like this:

tup = XSet.tuple_set(("a", "b", "c", "d"))

And its internal representation is this:

XSet(frozenset({('d', 4), ('b', 2), ('a', 1), ('c', 3)}))

It is a frozen set containing four two-tuples, one for each element of the n tuple. Notice that because it’s a frozen set, it’s not even in the numerical order. Set theory doesn’t care.

This is not, however, a really marvelous way to represent an ordered n-tuple in a computer. In Python, we would use either a list or a tuple. If we are going to edit the thing, we’d use a list, otherwise a tuple.

A Story!

We have a story to implement. It’s an internal story but since we are building a data manipulation system, it’s still a story. In fact we’ll write it as a story with a business purpose:

For storage efficiency and possible execution efficiency, the system should support storing ordered n-tuples in a compact form, without the unnecessary scope values being stored repeatedly.

Now it happens that the input to the tuple_set class method is exactly what we need: a Python n-tuple. We might, in principle, want to check that, but for now, we’ll accept that that’s what we’ll be receiving.

So, what we’d like to do is to store our n-tuple in the XSet contents and have it just work. But it probably won’t.

We can hack together a test to try it, like this:

    def test_hacked_n_tuple(self):
        n_tuple = ("a", "b", "c")
        test_set = XSet.tuple_set(n_tuple)
        test_set.contents = n_tuple  # Jam it in there
        assert test_set.includes("a", 1)

This fails. I am not surprised, but I’m not sure just how it will fail. Yes, I know I’m supposed to predict it. I think it will have returned False, because (“a”, 1) is not in (“a”, “b”, “c”, “d”). Now are you happy?

>       assert test_set.includes("a", 1)
E       AssertionError: assert False
E        +  where False = <bound method XSet.includes of XSet(('a', 'b', 'c'))>('a', 1)
E        +    where <bound method XSet.includes of XSet(('a', 'b', 'c'))> = XSet(('a', 'b', 'c')).includes

I do not know why Pytest gives these weird results with the bound method stuff. I’ll put it on the list to figure out. But what we see is that we got the False.

    def includes(self, element, scope) -> bool:
        if not scope:
            scope = self.null
        return (element, scope) in self

The pair isn’t in there. We cannot just dump a random tuple in there and have it work.

But we can probably write a little class that can do the job.

We’ll try. Here’s a little class:

class X_tuple:
    def __init__(self, data):
        self.data = data

    def includes(self, e, s):
        return s <= len(self.data) and self.data[s-1] == e

Now if we were to call that includes, I think it would work. Let’s patch up the test … and then the XSet includes:

    def test_hacked_n_tuple(self):
        n_tuple = ("a", "b", "c")
        test_set = XSet.tuple_set(n_tuple)
        impl = X_tuple(n_tuple)
        test_set.contents = impl  # Jam it in there
        assert test_set.includes("a", 1)

This does not pass, yet. Now we patch XSet:

    def includes(self, element, scope) -> bool:
        if not scope:
            scope = self.null
        try:
            return self.contents.includes(element, scope)
        except AttributeError:
            return (element, scope) in self

Before we do includes our XSet way, we try the contents object for includes. If it returns, we use the result. If it raises the exception, we do it out way.

Let’s enhance the test a bit and then reflect. I am not sure at all that we’ll commit this: it’s pretty spiky.

    def test_hacked_n_tuple(self):
        n_tuple = ("a", "b", "c")
        test_set = XSet.tuple_set(n_tuple)
        impl = X_tuple(n_tuple)
        test_set.contents = impl  # Jam it in there
        assert test_set.includes("a", 1)
        assert test_set.includes("c", 3)
        assert test_set.excludes("a", None)

The last assertion fails. Why? Because the scope is not a number, and my hacked-in code assumes that it is.

    def includes(self, e, s):
        return isinstance(s, int) and s <= len(self.data) and self.data[s-1] == e

Test runs again. Extend a bit further:

    def test_hacked_n_tuple(self):
        n_tuple = ("a", "b", "c")
        test_set = XSet.tuple_set(n_tuple)
        impl = X_tuple(n_tuple)
        test_set.contents = impl  # Jam it in there
        assert test_set.includes("a", 1)
        assert test_set.includes("c", 3)
        assert test_set.excludes("a", None)
        assert test_set.excludes("a", 3)
        assert test_set.excludes("d", 4)

OK. Since we are green and the only change to XSet is the try of includes on its contents, I feel OK committing this, and I’d rather have this stepping stone to work from than not. Commit: Initial hacked implementation of X_tuple with includes method.

Let’s reflect and, probably, sum up.

Reflection

We have a small, test-driven, proof of concept here. The tiny class X_tuple can contain a sequence that represents an ordered n_tuple. I am confident, but did not test, that it would work even if we passed in a series of personnel records. Why? Because the equals operator works on sets, so when our new includes says self.data[s-1]==e, it’s going to do the right thing.

If we follow this pattern, each set operation can, if it chooses to, try to delegate its operation to the XSet contents, and upon getting the AttributeError, do the job itself. There is one issue that I am aware of, the method is_subset. It is currently implemented by calling issubset on the contents, because, in their wisdom, Python named the method like that. I choose not to do that. As things stand, frozenset will not recognize is_subset and it will all work, but it would be nicer if the delegation worked. Maybe. Anyway, it’s an issue.

The pattern isn’t bad. I could imagine that we could pass down a symbolic name to the contents and that it would then have only a single method with a dispatch but I don’t see that as better. (If that makes no sense to you, ignore this paragraph.)

We can see a pattern emerging, something about the structure of one of these “override” delegated implementations: they’ll want to check to see if the inputs are something they can handle, and they may sometimes need to raise AttributeError directly, to let the XSet default method deal with things.

We’ll learn more about that as we work with our new X_tuple a bit more. Right now, it can’t even iterate itself, so I am sure we can write a test that will fail until we work out that detail.

I can imagine another specialized structure, X_classical, a set all of whose scopes are known to be null. Again, it would save storage, if we care about that kind of thing.

Of course the “big” issue is probably sets in file storage. We may get there …

Finally … I’m not sure how valuable these overrides will be. I do think we’ll find other structures that are useful. For example, our “personnel” records each include both the field name and the field value:

hill = XSet([("hill", "last"), ("geepaw", "first"), ("serf", "job")])

That isn’t as efficient as it might be. A better structure for a set of records all with the same field names would surely be to store the field names once, perhaps with an associated field number, and then store the fields in an array in field-number order. Or, of course, if on files, we might store them in fixed-length records of bytes, again with some kind of “header” structure that we could use to fetch the values as needed.

We could let the system return records to the form above whenever it wanted to, but with some operations working at the lower level if it made more sense.

As it stands, I think these would be “programmer” decisions, but it seems possible to write code that would analyze a set and make a programmatic decision about its storage structure. I am hopeful that this little project will bear enough weight, and that my interest will be sustained, long enough to try some of those things.

Why? Because I find it interesting, and because Extended Set Theory has long been a bête noire for me, or at least a long-term sometimes friend sometimes foe. The Magneto to my Charles Xavier. The Moriarty to my Holmes. A challenge often met and never quite conquered. A sometime passion. An occasional fixation. Something to do.

Summary

This tiny experiment tells me that plugging unique specialized set structures underneath XSet is surely feasible and quite possibly useful. We’ll proceed down that path to see where it leads. Meanwhile, I’ll need to be creating more set operations, and trying to find ways to make this thing at least a little bit useful.

I am afraid that along the way, I’m going to have to learn how to create user interfaces in Python. I expect that to be tedious but simple.

I hope you’ll follow along, and I welcome questions and comments.

See you next time!