More Flattery

Our experimental tests one flat records look good. Let me report on my off-line work, and then let’s see what’s next.

Hello, friends! Happy Valentine’s day to those who celebrate.

Since last I typed, I’ve learned a thing, done a thing, been given an idea, and wondered about a thing.

Learned

I learned that while there are lots of free CSV files out on the internet, there are very few free fixed-field flat files out there. I guess this means we’ll have to build a test file ourselves, if we want one.

Did

I did a thing as well. You may recall that at the end of yesterday’s work, we had a test that built a “symbol table” and used it to extract fixed-length fields from a record. That left us testing for values like 'Jeffries ', which wasn’t as much fun as it might have been. I tweaked the test and code to trim the fields as we extract them:

    def test_unpack(self):
        def field_set(record, symbols):
            result = ((record[start:finish].strip(), name) for name, start, finish in symbols)
            return XSet(result)
        record = 'Jeffries    Ronald      Boss        '
        symbols = (("last", 0, 12), ("first", 12, 24), ("job", 24, 36))
        fields = field_set(record, symbols)
        assert fields.includes("Jeffries", "last")
        assert fields.includes("Ronald", "first")
        assert fields.includes("Boss", "job")

Note the strip() call in the result comprehension. That removes leading and trailing blanks from all the fields. We’ll probably move forward with that pattern, and in general try to work without leading and trailing spaces on our strings. It would be possible to create a string-bearing class of our own that enforces trimming, but I think it would be a pain, at least in testing, as we’d have to wrap all the elements and scopes that are strings.

I think element-wrapping would be possible, might be a good idea, and I think we’ll hold off on that for now. We have bigger learnings to learn.

Given idea

In last night’s Tuesday meeting of the Friday Geeks Night Out Zoom Ensemble, Chet mentioned that an advantage of a flat file is that you can, with just a bit of arithmetic, pick values right out of the middle of it. Chet has observed me working on XST before, so he is aware of some of the tricks, and that may have triggered the thought, or perhaps it came to mind for some other reason. At this writing, we do not have the need to do that, but here’s an imaginary scenario.

Suppose we wanted SELECT LAST, FIRST FROM FLATPEOPLE WHERE JOB EQ "Serf". In our present parlance, we could get that result with something like this:

temp = flatpeople.restrict({serf^job})
result = temp.project({last, first})

But if we knew how to rewrite set expressions (yes, Bill, you’re right of course) we could rewrite that expression into a bunch of scatter-gather file reads, fetching last and first with random access to the records found under ‘serf’ in the index. And that would be cool.

Perhaps, if we’re very lucky, and my interest lasts long enough, and my life lasts long enough, we’ll get to a point with this thing where that is possible. Today, it isn’t.

Wondered

Which brings me to what I have been wondering subsequent to Chet’s observation: what is the set operation that could represent a record in a flat file? I have a theory about the answer. Consider the file as an ordered n-tuple of records:

all = { rec1¹, rec2², …, recnⁿ }

What operation could we do that would give us, say, this set:

some = { rec41¹, rec42², rec43³ }?

In one of his papers, Dave Childs introduced the operations “Re-scope by scope” and “Re-scope by element”. The operation amounts to rename with selection. The Childs notation for re-scope by scope is X^/s/. If I have it right …

some = all^/s/, where s = { 41¹, 42², 43³ }

The two operations “by scope” and “by element” differ only in which of element, scope selects the element from the set and which is the new scope. The slash happens to show which way it goes so that / selects by element and renames by scope. I wonder if Dave did that on purpose.

So the set s above says something like “select item 41, call it 1; item 42, call it 2; item 43, call it 3”

Anyway, point is, I was wondering what set operation might select records from a flat structure, and it looks like re-scope may be one good answer. I don’t know that we’ll need it, but we’ll keep it in our hip pocket.

Plan

It comes down now to what we do next. We have two specialized set forms sort of in flight. We have the partially-completed X_tuple, which I abandoned for reasons I cannot recall, and the more recent XFlat idea, for which we have some tests that show how the set might be created. I think we’ll proceed with XFlat, but I think we may defer the file access part of it in favor of more interesting or more useful notions, perhaps even CSV files, which could conceivably be useful.

The XFlat set is intended to represent a single flat record, from which it can answer the usual questions. We …

Oops: If the XFlat set were to strip its fields to return them … it will have to extend them if we want the result to be another XFlat instance. I didn’t think of that when I put in the strip call.

As I was saying, we … may want to have the project method convert from one long XFlat to another shorter XFlat. The flat format is actually better than our canonical form, in that it only stores the elements. Yes, it has the symbol table, but that will be a single table passed to each XFlat, so it won’t be duplicated the way the scopes are in a canonical XSet.

We’ll cross that boat when we come to it.

Let’s write some tests to drive out XFlat as an XImplementation.

    def test_flat_set_makes_fields(self):
        f1, f2, f3 = XFlat.fields(("last", 12, "first", 10, "job", 20))
        assert f1 == ("last", 0, 12)
        assert f2 == ("first", 12, 22)
        assert f3 == ("job", 22, 42)

The idea is that XFlat makes its symbol table using the class method fields. That’s enough to drive out the class and a class method. I’ll let PyCharm create the class locally, and move it in due time. Very soon, actually.

class XFlat:
    @classmethod
    def fields(cls, names_and_lengths):
        it = iter(names_and_lengths)
        by_two = zip(it, it)
        start = 0
        field_definitions = []
        for symbol, length in by_two:
            field_definitions.append((symbol, start, start+length))
            start += length
        return field_definitions

Test runs. I just copy-pasted the method from the function we wrote yesterday. Is there some way to do that loop in a comprehension? This works:

    @classmethod
    def fields(cls, names_and_lengths):
        it = iter(names_and_lengths)
        by_two = zip(it, it)
        start = 0
        field_definitions = []
        for symbol, length in by_two:
            field_definitions.append((symbol, start, start := start+length))
        return field_definitions

Note that I embedded an assignment back to start in the append. So I think …

    @classmethod
    def fields(cls, names_and_lengths):
        it = iter(names_and_lengths)
        by_two = zip(it, it)
        start = 0
        field_definitions = [(symbol, start, start := start + length) for symbol, length in by_two]
        return field_definitions

That passes. What I really want, I think, is the method I call inject, from my Smalltalk days. I think the cool kids now call it reduce. We’ll look into that: I’ve made a note of it. This is working nicely. Let’s move the XFlat class to /src/ and commit: initial XFlat fields class method.

At this point a question comes up. There is a test in test_x_flat that includes the symbol table creation:

    def test_make_symbol_table(self):
        def make_symbols(names_and_lengths):
            it = iter(names_and_lengths)
            by_two = zip(it, it)
            start = 0
            field_definitions = []
            for symbol, length in by_two:
                field_definitions.append((symbol, start, start+length))
                start += length
            return field_definitions
        info = ("last", 12, "first", 10, "job", 8)
        symbols = make_symbols(info)
        s1, s2, s3 = symbols
        assert s1 == ("last", 0, 12)
        assert s2 == ("first", 12, 22)
        assert s3 == ("job", 22, 30)

That test was there to teach us how to build the table. There’s now a method, indeed a more refined one, in the XFlat set. Should we remove this historical test?

In favor, the test file will be less cluttered. Against, history is interesting, and in removing some test that we think it covered elsewhere, we might actually be removing just a bit of safety. Also against, it’s make-work. We’ll remove the test if and when it actually irritates us.

Thanks for asking.

We can proceed in a few different ways:

Make XFlat inherit XImplememtation and follow our nose to implement tests and methods;
Implement tests and methods incrementally until we can make XFlat inherit without error;
Implement tests against XSet that will require us to solve the problem of creating a set that is not using XFrozen as its implementation.

Approach #2 is more pure than #1. It demands that we think hard about what XFlat needs in order to satisfy the demands of being an XImplementation child. Bah. As Chet Hendrickson puts it, a mind is a terrible thing to waste, so use yours sparingly. We’ll let PyCharm tell us what to do, and help us do it.

As soon as I say:

class XFlat(XImplementation):

PyCharm complains and offers to implement the methods for me. I’ll allow it.

class XFlat(XImplementation):
    def __contains__(self, item):
        pass

    def __iter__(self):
        pass

    def __hash__(self):
        pass

    def __repr__(self):
        pass

Let’s see if we can sneak up on this. We’ll need an __init__, which we’ll drive out like this:

    def test_drive_out_init(self):
        fields = XFlat.fields(("last", 12, "first", 10, "job", 20))
        record = 'Jeffries    Ronald    Wizard              '
        flat = XFlat(fields, record)

Creating that record was a pain and I am not even sure it’s accurate. Counting spaces is error-prone for us humans. No matter. We need:

class XFlat(XImplementation):
    def __init__(self, fields, record):
        self.fields = fields
        self.record = record

In for a penny, let’s commit: init added.

Now for contains:

    def test_contains(self):
        fields = XFlat.fields(("last", 12, "first", 10, "job", 20))
        record = 'Jeffries    Ronald    Wizard              '
        flat = XFlat(fields, record)
        assert ('Jeffries', 'last') in flat

That fails, which I expected, but surprisingly fails with an exception in pytest, because we do not implement __repr__. Amusing.

I think this XFlat is going to be interesting. For now:

    def __repr__(self):
        return f"XFlat('{self.record}')"

Test now fails as expected, not finding the thing.

How are we to do this? We do have this test:

    def test_unpack(self):
        def field_set(record, symbols):
            result = ((record[start:finish].strip(), name) for name, start, finish in symbols)
            return XSet(result)
        record = 'Jeffries    Ronald      Boss        '
        symbols = (("last", 0, 12), ("first", 12, 24), ("job", 24, 36))
        fields = field_set(record, symbols)
        assert fields.includes("Jeffries", "last")
        assert fields.includes("Ronald", "first")
        assert fields.includes("Boss", "job")

Interesting, but not quite what I have in mind¹. Let’s just code up __contains__. We’ll look for the scope and if we find it, extract the field value and trim it.

    def __contains__(self, item):
        element, scope = item
        for symbol, start, end in self.fields:
            if symbol == scope:
                field = self.record[start:end].strip()
                return field == element
        return False

We loop through the symbols, and if we find the scope we’re looking for, we fetch the field and see if it equals the element we’re looking for. If we exhaust the symbol table, we return False.

Test passes. Extend it a bit, for fun.

    def test_contains(self):
        fields = XFlat.fields(("last", 12, "first", 10, "job", 20))
        record = 'Jeffries    Ronald    Wizard              '
        flat = XFlat(fields, record)
        assert ('Jeffries', 'last') in flat
        assert ('Ron', 'first') not in flat
        assert ('Wizard', 'job') in flat

Passes. Commit: contains.

We’re left with __hash__ and __iter__.

I think __iter__ will be easy enough. But what about __hash__?

This is an area where my understanding is weak. I mean, yeah, we all know roughly what happens. Items that are the same are supposed to generate equal hashes and if two items generate different hashes, they are not supposed to be equal (and will, in at least some cases, be treated as unequal).

I need to study a bit and I think we’ll want to work up some tests that verify that our hashing is OK. Note made.

Let’s move on to drive out the __iter__.

    def test_iteration(self):
        fields = XFlat.fields(("last", 12, "first", 10, "job", 20))
        record = 'Jeffries    Ronald    Wizard              '
        flat = XFlat(fields, record)
        elements = []
        scopes = []
        for e,s in flat:
            elements.append(e)
            scopes.append(s)
        assert elements == ['Jeffries', 'Ronald', 'Wizard']
        assert scopes == ['last', 'first', 'job']

Kind of grungy, but should do the job. Fails saying no iterator. We knew that.

    def __iter__(self):
        return ((self.record[start:end].strip(), symbol) for symbol, start, end in self.fields)

Right, that’s all there is to it. Test passes. Commit: __iter__.

As for hash, I implement this for now:

    def __hash__(self):
        return -1

Commit: hash implemented as -1.

Let’s take a quick look at XFlat, and then sum up. I can tell that I’m in need of a break.

class XFlat(XImplementation):
    def __init__(self, fields, record):
        self.fields = fields
        self.record = record

    def __contains__(self, item):
        element, scope = item
        for symbol, start, end in self.fields:
            if symbol == scope:
                field = self.record[start:end].strip()
                return field == element
        return False

    def __iter__(self):
        return ((self.record[start:end].strip(), symbol) for symbol, start, end in self.fields)

    def __hash__(self):
        return -1

    def __repr__(self):
        return f"XFlat('{self.record}')"

    @classmethod
    def fields(cls, names_and_lengths):
        it = iter(names_and_lengths)
        by_two = zip(it, it)
        start = 0
        field_definitions = [(symbol, start, start := start + length) for symbol, length in by_two]
        return field_definitions

Summary

I think this XFlat record “works”, in that it can check for containment and can iterate its elements and scopes. There are issues:

I am concerned overall about our use of hashing. This could be a major issue if we ever create two things that we consider equal and they are found not to be.
The __contains__ method can probably be improved, perhaps just by using __iter__. It’s OK as is.
The symbol table probably would like to be better structured. In particular, a dictionary by field name seems a likely candidate.
Here and elsewhere, we create a lot of 2-tuples on the fly, knowing that they are the essential bottom-level structure that give a set its element-scope pairs. It would be easy to make a mistake in coding these. In fact, before I completed __iter__, I had the tuple in the order scope, element rather than element, scope. It is possible that we should go back to a purpose-built class of our own to represent this object. Our objections to it were two: it was more of a pain to create things, and the printouts were hard to read with all the Atom Atom Atom going on.
I have a general sense of uneasiness about the XFlat implementation, as the first alternative set data structure. Will it interact properly with other sets? I have no basis, beyond the hashing concern. We have a very narrow interface in XImplementation and our tests seem solid. Yet I am uneasy. I’ll try to isolate the concerns and create tests for them.

All that said, I am confident at the p > 0.75 level that the XFlat works properly.

What we need next, I guess, would be a set that holds a compact array of flat records. I have been supposing that such a set would iterate by producing one flat record after another. I wonder, though, whether that’s the way to do it.

I am speculating about taking a slightly different angle from the XFlat, a sort of XFlatSet that covers a buffer of records from an object that buffers records. The XFlatSet can iterate through its buffer. I guess it might still produce an XFlat, but sooner or later, someone is going to produce the tuples that we actually use in our set operations …

Too many loose ends. I think we’ll build something and then see what it tells us. Meanwhile I will surely reflect on the next steps and imagine some optional approaches, from which we’ll choose one.

We move forward. From this improved vantage point, I can see a little way further, but I do not see as far as I might, as clearly as I might. There’s mist out there. Things will be more clear as we approach. Waiting is.

See you next time!

I am reminded of something my brother Dick, PBUH, said in regard to something or other: “It is like green shoes, interesting but not sought after”. ↩