FAFO on GitHub

Thinking about rename in the Flat implementation leads to discovery of an interesting defect, and some thinking about use of the powerful generality of set theory.

Hello, again, friends!

We have a function that will rename all the elements of a set. It’s nifty, in that it uses a clever series of set operations to do the job. But there are issues, topics, notions:

  1. What if we have a set of sets and want to rename things in the inner set?
  2. What if the set of sets is large?
  3. Isn’t there some way to do this better with custom code?

What we have before us, friends, is some questions about design, the best way, or better way, to do things.

Recall that our re-scoping set is a set of old-name, new-name pairs, with old as the element and new as the scope. There are some interesting things that could occur but that are unlikely, involving duplicates one way or the other. Consider this set:

{ ax, ay }

It would take all the things named a, and put two copies into the output, one named x and one named y.

    def test_rename_duplicates(self):
        new_names = XSet.from_tuples((('a','x'), ('a', 'y')))
        to_rename = XSet.from_tuples((('hello', 'a'), ('hi', 'b')))
        as_renamed = to_rename.rename(new_names)
        assert as_renamed == XSet.from_tuples((('hello', 'x'), ('hello', 'y'), ('hi', 'b')))

THis test shows the effect. It runs green.

We can go the other way as well. Extend the test:

    def test_rename_duplicates(self):
        new_names = XSet.from_tuples((('a','x'), ('a', 'y')))
        to_rename = XSet.from_tuples((('hello', 'a'), ('hi', 'b')))
        as_renamed = to_rename.rename(new_names)
        assert as_renamed == XSet.from_tuples((('hello', 'x'), ('hello', 'y'), ('hi', 'b')))
        back_to_old = XSet.from_tuples((('x', 'a'), ('y', 'a')))
        original = as_renamed.rename(back_to_old)
        assert original == to_rename

We could produce that undoing set by other means, and I think I mentioned that there are two re-scoping operations possible, one that names from element to scope and the other from scope to element. We’re trying to make do with just one.

But my point, and I do have one, is that given the current definition of rename and re-scope, the operation can produce more or fewer elements than in the original set. This makes name-mapping both powerful, on the one hand, and tricky to optimize, on the other.

Design Issue - Generality

Our set operations can build some very complicated set structures even with the few operations we presently have. While there can be some advantages to a clever structure, it’s all too easy to build something we can’t understand. It’s a bit like code in that regard.

Does that mean that we should write more specialized operations that cannot, say, add a copy of a field, or remove a field, as re-scope or rename can? I think not. But I do think that we should use the generality sparingly in our programs, keeping the structures that we actually build as understandable as we can … probably hewing pretty close to fixed-structure records and the like.

How flat are they?

But what does that mean, for example, with something like our XFlat and XFlatFile implementations?

As the symbol table is presently defined, each field follows directly after the preceding one, so that the record is a fixed-length string. But, that aside, there’s nothing that would prevent there being two overlapping definitions in the symbol table, maybe name = last_name first_name. And there’s certainly nothing preventing the same name from turning up multiple times: we don’t check for that, so you could have six fields called pay if you wanted to.

And, of course, if they all came up with the same value, at least some set operations would collapse them.

I think we need to try that. And I get a result that surprises me greatly:

    def test_multiple_same_name(self):
        fields = XFlat.fields(('pay', 4, 'pay', 4))
        record = 'abcdefgh'
        flat = XFlat(fields, record)
        flat_set = XSet(flat)
        for e,s, in flat_set:
            print(e, s)
        assert flat_set.includes('abcd', 'pay')
        assert flat_set.includes('efgh', 'pay')

That test fails, but it prints both values from the set.

abcd pay
efgh pay

Is there a bug in includes?? This test passes:

    def test_duplicate_scope(self):
        dups = XSet.from_tuples((('abcd', 'pay'), ('efgh', 'pay')))
        assert dups.includes('abcd', 'pay')
        assert dups.includes('efgh', 'pay')

So the basic includes is good. I feel sure that if it were not, something bad would have happened. How is this happening? XSet includes is this:

class XSet:
    def includes(self, element, scope) -> bool:
        if scope is None:
            scope = self.null
        try:
            return (element, scope) in self.implementation
        except NotImplemented:
            return any(e == element and s == scope for e, s in self)

So does XFlat implement __contains__? It does and it explains the issue:

    def __contains__(self, item):
        element, scope = item
        for symbol, start, end in self.fields:
            if symbol == scope:
                field = self.record[start:end].strip()
                return field == element
        return False

As soon as we find the desired scope, we check the element for equality and return. If there is more than one element of that name, we’ll return false. We can fix it:

    def __contains__(self, item):
        element, scope = item
        for symbol, start, end in self.fields:
            if symbol == scope:
                field = self.record[start:end].strip()
                if field == element:
                    return True
        return False

Our test runs correctly. Commit: fix issue prohibiting duplicate field names in XFlat.

Reflection

So this thinking was useful. When we wrote XFlat, I was thinking that there’d be all these fields, each with a unique name. That’s what you’d do with a flat file, typically, but there’s no real reason why you would have to do that. You could define that a flat file has unique field names, and if so, it would be good to enforce it. Arguably a likely bug is to accidentally duplicate a field definition, so the rule could be a good one.

If we constrain the symbol table to have no duplicate names, and no overlapping fields, we have your basic flat file format. But it’s “easy to see” that we could have a different set of rules, allowing symbol tables with duplicate names and/or overlapping fields.

I think we’re better off with the __contains__ dealing with the duplicates as it does, but, depending on what we are trying to model with our flat sets, we may want to revisit the creation of the symbol table.

Summary

I think there is a lot of design that comes into play whenever we build a very general set of capabilities. Here, we have encountered a few interesting issues:

  1. Our XFlat implementation allows something I didn’t think of, duplicate field names.
  2. Our implementation of __contains__ did not properly handle said duplicate names.
  3. We modified __contains__ to handle the duplicates.
  4. But we may want to disallow them, providing a more limited case for the flat implementations.

We could, in principle, use XST for a very wide range of applications. In practice, I think it has mostly been used to build relational systems, but I know of exceptions. As a guideline, we might want to limit the external behavior of an XST-based system to a less general model, while using more of its power inside, to make good things happen. I suspect we do not want to expose regular users to extended set theory.

Here, we are experimenting and learning though we may turn our attention to some kind of application as we go forward. So we should expect to learn little things like this, and, quite probably, I’ll surprise myself again with some result that isn’t quite what I thought I had done.

All part of learning, in all the work we do. See you next time!