Bad Idea

FAFO on GitHub

Putting a specialized rename into our flat set showed us that the symbol table there is quite ad hoc. Let’s make it more like a set. (Turns out: no.)

I roll back all the code in this large article, because it becomes more and more clear that at least the particular implementations I choose are wrong, and quite possibly the whole idea is wrong.

I strongly advise you to skip to the reflection. Failing that, scan, knowing that we’re going to roll it all back.

The XFlatFile set implementation uses a simple ad-hoc table as the symbol table: a list of name, starting index, and stopping index for each field. Its creation and usages are these:

class XFlat(XImplementation):
    @classmethod
    def fields(cls, names_and_lengths):
        it = iter(names_and_lengths)
        by_two = zip(it, it)
        start = 0
        field_definitions = [(symbol, start, start := start + length) for symbol, length in by_two]
        return field_definitions


    def __contains__(self, item):
        element, scope = item
        for symbol, start, end in self.fields:
            if symbol == scope:
                field = self.record[start:end].strip()
                if field == element:
                    return True
        return False

    def __iter__(self):
        for symbol, start, end in self.fields:
            yield self.record[start:end].strip(), symbol

class XFlatFile(XImplementation):
    def rename_contents(self, re_scoping_set):
        new_names = []
        for name, start, len in self.fields:
            changed_name = name
            for old, new in re_scoping_set:
                if name == old:
                    changed_name = new
            new_names.append((changed_name, start, len))
        new_impl = self.__class__(self.file_path, new_names, self.scope_set)
        return XSet(new_impl)

Of course, having set ourselves the task of eating our own dog food here, we need to ask ourselves, selves, what should the symbol set be?

From the viewpoint of renaming, which is on our mind, a set like this would be ideal:

{ info1^name1, info2^name2 }

With that structure, we can just use our regular rename operation.

The infos could be anything. Python has a slice object that can be used, well, to slice things. I happen to have written a test to show me how the method works, at some time in the past. Here it is:

    def test_slice(self):
        a = 'abcdefghi'
        d = a[3:6]
        assert d == "def"
        assert isinstance(d, str)
        s = slice(3, 6)
        d2 = a[s]
        assert d2 == 'def'

Note that the object ‘s’ is a “slice” and, as advertised, it slices. So let’s built our symbol table like this:

{ slice_1^name_1, slice_2^>name_2 }

My cunning plan is just to build it that way and when things break, fix up the other places. I think it possible, but not too likely, that we may come up with ideas for new set operations along the way. We are clean and green: let’s get started.

We currently create our symbol table like this:

    @classmethod
    def fields(cls, names_and_lengths):
        it = iter(names_and_lengths)
        by_two = zip(it, it)
        start = 0
        field_definitions = [(symbol, start, start := start + length) for symbol, length in by_two]
        return field_definitions

And we’ll do it, instead, like this:

    @classmethod
    def fields(cls, names_and_lengths):
        it = iter(names_and_lengths)
        by_two = zip(it, it)
        start = 0
        field_elements = [(slice(start, start := start + length), symbol) for symbol, length in by_two]
        return XSet.from_tuples(field_elements)

I feel pretty good about that. 17 tests fail. I am hoping for some simple tests of the fields method among them.

Ow! The error is in from_tuples: unhashable type: ‘slice’.

Well. That bites. I can’t think why slice would be unhashable, but there we are.

OK, plan b, we’ll just make a tuple of start, stop:

    @classmethod
    def fields(cls, names_and_lengths):
        it = iter(names_and_lengths)
        by_two = zip(it, it)
        start = 0
        field_elements = [((start, start := start + length), symbol) for symbol, length in by_two]
        return XSet.from_tuples(field_elements)

Still 16 tests fail, but which ones? This could be just fine.

The most basic test is this:

    def test_flat_set_makes_fields(self):
        f1, f2, f3 = XFlat.fields(("last", 12, "first", 10, "job", 20))
        assert f1 == ("last", 0, 12)
        assert f2 == ("first", 12, 22)
        assert f3 == ("job", 22, 42)

Since XFlat.fields returns an XSet, I am more than a bit curious what that multiple assignment does, and am quite surprised to discover that it actually returns three element-scope tuples from the set. Upon reflection, that must be because an XSet can be iterated and Python went ahead and did it. Unfortunately for us, the elements are not in the input order, so the assertions fail, not just because they are the wrong form, but because they’re in the wrong order. We can probably fix that:

    @classmethod
    def fields(cls, names_and_lengths):
        it = iter(names_and_lengths)
        by_two = zip(it, it)
        start = 0
        field_elements = [((start, start := start + length), symbol) for symbol, length in by_two]
        return XSet.n_tuple(field_elements)

That will keep the elements in the original order. The test still fails, but fails better:

>       assert f1 == ("last", 0, 12)
E       AssertionError: assert (((0, 12), 'last'), 1) == ('last', 0, 12)

Now we have the right one, but we need to unwind it a bit. The element there, if we consider its residence in a set, is

{ ((0, 12)^last)¹ }

We’ll unwind in the test:

Wait. I’ve made a mistake. Our n_tuple set does give each element of the set an index 1, 2, 3, but it does not keep them in that order. We could make an XImplementation for that case, but we have not. So we can no longer expect that the elements will come out in numeric order.

Our test must accommodate that.

    def test_flat_set_makes_fields(self):
        fields = XFlat.fields(("last", 12, "first", 10, "job", 20))
        for entry, number in fields:
            slice, name = entry
            if name == 'last':
                assert slice == (0, 12)
            elif name == 'first':
                assert slice == (12, 22)
            else:
                assert slice == (22, 42)

OK, we have verified that we built what we think we built. I think the rest of the tests are likely failing because we don’t use our symbols correctly.

    def test_contains(self):
        fields = XFlat.fields(("last", 12, "first", 10, "job", 20))
        record = 'Jeffries    Ronald    Wizard              '
        flat = XFlat(fields, record)
        assert ('Jeffries', 'last') in flat
        assert ('Ron', 'first') not in flat
        assert ('Wizard', 'job') in flat

This will be using __contains__ in XFlat:

    def __contains__(self, item):
        element, scope = item
        for symbol, start, end in self.fields:
            if symbol == scope:
                field = self.record[start:end].strip()
                if field == element:
                    return True
        return False

That needs to be just a bit different, to accommodate that our symbol table is a …

Why is it what it is? Why is it a set of pairs?

I believe that I have made a mistake and made the set too deep. I may have to roll back but let’s look:

    @classmethod
    def fields(cls, names_and_lengths):
        it = iter(names_and_lengths)
        by_two = zip(it, it)
        start = 0
        field_elements = [((start, start := start + length), symbol) for symbol, length in by_two]
        return XSet.n_tuple(field_elements)

I think from_tuples was correct. My attempt to get things in the right order made a nested set. Remind me to reflect on that. But first let’s make it work.

    @classmethod
    def fields(cls, names_and_lengths):
        it = iter(names_and_lengths)
        by_two = zip(it, it)
        start = 0
        field_elements = [((start, start := start + length), symbol) for symbol, length in by_two]
        return XSet.from_tuples(field_elements)

And now back to the same old test:

    def test_flat_set_makes_fields(self):
        fields = XFlat.fields(("last", 12, "first", 10, "job", 20))
        for entry, number in fields:
            slice, name = entry
            if name == 'last':
                assert slice == (0, 12)
            elif name == 'first':
                assert slice == (12, 22)
            else:
                assert slice == (22, 42)

It should now say this:

    def test_flat_set_makes_fields(self):
        fields = XFlat.fields(("last", 12, "first", 10, "job", 20))
        for slice, name in fields:
            if name == 'last':
                assert slice == (0, 12)
            elif name == 'first':
                assert slice == (12, 22)
            else:
                assert slice == (22, 42)

And passes. Now back to the contains one:

    def test_contains(self):
        fields = XFlat.fields(("last", 12, "first", 10, "job", 20))
        record = 'Jeffries    Ronald    Wizard              '
        flat = XFlat(fields, record)
        assert ('Jeffries', 'last') in flat
        assert ('Ron', 'first') not in flat
        assert ('Wizard', 'job') in flat

And:

    def __contains__(self, item):
        element, scope = item
        for symbol, start, end in self.fields:
            if symbol == scope:
                field = self.record[start:end].strip()
                if field == element:
                    return True
        return False

That should now be:

    def __contains__(self, item):
        element, scope = item
        for slice, symbol in self.fields:
            if symbol == scope:
                field = self.record[slice[0]:slice[1]].strip()
                if field == element:
                    return True
        return False

The test_contains now passes. Here’s another failing test:

    def test_iteration(self):
        fields = XFlat.fields(("last", 12, "first", 10, "job", 20))
        record = 'Jeffries    Ronald    Wizard              '
        flat = XFlat(fields, record)
        elements = []
        scopes = []
        for e,s in flat:
            elements.append(e)
            scopes.append(s)
        assert elements == ['Jeffries', 'Ronald', 'Wizard']
        assert scopes == ['last', 'first', 'job']

We need to fix the XFlat iterator this time. Old:

    def __iter__(self):
        for symbol, start, end in self.fields:
            yield self.record[start:end].strip(), symbol

Instead:

    def __iter__(self):
        for start_end, symbol in self.fields:
            yield self.record[start_end[0]:start_end[1]].strip(), symbol

The test still fails, because it has been assuming that the symbols will come out in input order. We could enforce that, but I am not sure whether we should.

Let’s fix this test:

    def test_iteration(self):
        fields = XFlat.fields(("last", 12, "first", 10, "job", 20))
        record = 'Jeffries    Ronald    Wizard              '
        flat = XFlat(fields, record)
        elements = set()
        scopes = set()
        for e,s in flat:
            elements.add(e)
            scopes.add(s)
        assert elements == {'Jeffries', 'Ronald', 'Wizard'}
        assert scopes == {'last', 'first', 'job'}

That passes. Let’s reflect.

Reflection

Added in Post: I hope you skipped to get here. On my time-line, I did all the work above and got here. In the next few paragraphs you can see me realizing that I’m not making things better.

The discovery that this statement returns tuples from an XSet was a surprise:

f1, f2, f3 = some_x_set

Must think about that, it might be good to formalize it, or at least to understand it.

There are probably more tests that are assuming a particular order to the results that come back when we iterate a flat set. We have broken that expectation by choosing to use the symbol as the scope of the start-stop tuple, because back when the symbol table was a table, it was in the order the names were defined, which was from left to right across the record.

To model the order, we do need another level in the set, if we want the symbols to have the set form.

I’ve just looked ahead at the next failing test, and it’s basically going to be the cause for a lot of them. To get the record length of the flat file, we look at the last entry in the symbol table. The code to do that no longer works since the table is now a set, and even then we’d have to search through looking for the maximum, since we have lost the order.

Added in Post: It becomes clear that this is not a good path.

All of this tsuris has come from my thinking that it would be better if the symbol table was an XSet, because our processing of it would be less ad hoc, in particular in rename, which we have not even reached yet.

I think this series of steps has been a mistake.

I am paused, thinking, at this point. I am sure we should roll back. I am stuck on the sunk cost of this article, me writing it and you reading it.

And yet, the article is a part of the historical record of building this Extended Set Theory experimental program. So I’ll edit it so that you can skip right down here, and only scan the details if you want to.

Roll back. Done.

Summary Reflection

Not every idea is a good one.

One way of looking at what happened here is simple, and true: I had an idea for how to make things better. I tried the idea, tried hard with it, and determined that it was not good enough. I bravely but reluctantly rolled it back.

Slightly less simple: the implementation of the symbol table on flat files had a property that is rather desirable, but that I had not really thought deeply about: the fields of the set appear in left to right order when you iterate such a record.

Generally speaking, a set has no defined order. And there is no guarantee, if we iterate the set, that elements will appear in any particular order. Set iteration isn’t even part of XST. We need it, but it’s really just our way of implementing the more abstract math that says for each of these there exists one of those such that blah blah mumble.

That can lead to some, um, issues when writing applications. You probably want the fields of the table you print out to come out in the same defined order every morning when you get the report. If you don’t specify the order, you want something sensible, and in the case of a set that is built on top of a file, you almost certainly want the order to be the same as the field order of the file. Probably.

The central point is that if we use sets to represent things in our application, we need to use sets that truly model what we want. Since sets have no inherent order (not even n-tuples), we will have to model that behavior explicitly.

In the case in hand, our symbol table, I modeled something that would be easy to rename, but in so doing, I lost some of the implicit but desirable other behavior, specifically the field ordering.

It’s possible that the symbol table “should” be an XSet, but the set I chose, given the facilities we now have, won’t quite do.

I must reflect further about this, because I would really really like symbol tables to be useful sets, as they will surely come up again and again. There is sort of a workaround available:

It is essential, experts would tell us, that we understand the set theoretic definition of all our data and operations. But the set in question is independent of the way we create it. So, in principle, it’s OK to have our symbols always come out in the order they were defined, as long as we can express that result in set theory. And it is OK if we don’t implement that order with set operations, as long as what we implement produces the desired theory-defined result.

It’s kind of a cop-out, in my opinion. But to get what we want in this situation, we’ll need to think a little harder, and possibly build some more clever set operations.

Bottom line, I’ve used a couple of hours trying a path into the future that turns out not to be a good one. Will some of the steps be useful? I don’t think so. Is there some deeper learning waiting to be discovered upon further reflection? I suspect so, and I suspect it will have something to do with order in sets.

See you next time, hopefully with code we can commit!

Very Late Discovery

There is, in process of being written, a new XImplementation, XTuple. It has the property that its scopes are 1..n, and that it iterates in numeric order. That was probably what I was thinking about when I tried the n_tuple creation method.

I think it is a bit questionable to have a set that iterates in a known order, although an n-tuple really seems like it should do that. It’s just that we can’t always rely on that happening unless we actually force the set into the XTuple implementation. For example: this is a perfectly good n-tuple:

{ c³, a¹, b² }

If we were to create that set, it would not necessarily iterate in numeric order, but if we used the new XTuple in a new_n_tuple constructor, this set would iterate in numeric order:

s = XSet.new_n_tuple( (a, b, c) )

I do not know if that’s good or bad. Must think. See you soon!