FAFO on GitHub

I want to work on some indexing operations. Along the way I rediscover XFlatFile’s existing scope set feature. Just what we need. Or is it? Includes a judicious rollback.

Hello, friends!

Remember the method re_scope? Given a set of pairs

r = { …, old_namenew_name, …},

it will rename any elements under scope old_name to be under scope new_name instead. It only returns elements specifically given an old_name-new_name pair. Think about what that method would do to a very large n-tuple. If we were to re-scope with { 12341 }, the resulting set would have one member, under scope 1, and it would be the member that had been at scope 1234.

So, among the things we can do with re_scope is random access to a large set … given that we implement re-scope carefully on that set. (If we do not, it will search the set, since the base implementation is a search.)

We can create a rudimentary but useful sort of index using this method. We’ll work up to it slowly.

First, let’s arrange for XFlatFile to support a better version of re_scope. Starting in XSet, we have:

class XSet:
    def re_scope(self, other) -> Self:
        new_tuples = []
        for e, s in self:
            for old, new in other:
                if old == s:
                    new_tuples.append((e,  new))
        return XSet.from_tuples(new_tuples)

That’s our basic search. Let’s set up our standard construction to allow an implementation to handle a set operation. First let’s extract a method from re_scope:

    def re_scope(self, other) -> Self:
        return self.generic_re_scope(other)

    def generic_re_scope(self, other):
        new_tuples = []
        for e, s in self:
            for old, new in other:
                if old == s:
                    new_tuples.append((e, new))
        return XSet.from_tuples(new_tuples)

We are green. We can commit. Sure, why not, takes no time: refactoring toward re scope in flat files.

Now put in our standard try / except:

    def re_scope(self, other) -> Self:
        try:
            return self.implementation.re_scope(other)
        except AttributeError:
            return self.generic_re_scope(other)

Green. Commit: add try/except for re_scope, to allow implementation handling.

Now we have everything committed, so we have a stable platform to address the re-scoping operation in XFlatFile.

Let’s review that class. Here are the interesting bits:

class XFlatFile(XImplementation):
    def __init__(self, file_path, fields, scope_set=None):
        self.file_path = file_path
        self.full_file_path = expanduser(file_path)
        self.fields = fields
        field_def = self.fields[-1]
        self.record_length = field_def[-1]
        self.scope_set = scope_set

    def __contains__(self, item):
        de, ds = item
        return de == self.element_at(ds)

    def __iter__(self):
        def lots():
            n = 1
            while True:
                yield n, n
                n += 1

        it = iter(self.scope_set) if self.scope_set else lots()
        for _e, scope in it:
            rec = self.element_at(scope)
            if rec is None:
                return
            yield rec, scope

    def __len__(self):
        file_length = stat(self.full_file_path).st_size
        return int(file_length / self.record_length)

    def __repr__(self):
        return f'XFlatFile({self.file_path})'

    def get_record(self, index):
        seek_address = index*self.record_length
        with open(self.full_file_path, "r") as f:
            f.seek(seek_address)
            rec = f.read(self.record_length)
        return rec

    def element_at(self, scope):
        if not isinstance(scope, int) or scope < 1:
            return None
        rec = self.get_record(scope - 1)
        if rec == '':
            return None
        return XSet(XFlat(self.fields, rec))

I had forgotten about the optional scope_set in the __init__. Where do we use that? Anywhere in prod? Certainly not: we don’t use it in prod at all. There is a test for the set:

    def test_uses_scope_set(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        scopes = XSet.from_tuples(((1, 1), (2,2)))
        ff = XFlatFile(path, fields, scopes)
        ff_set = XSet(ff)
        count = 0
        for _e, _s in ff_set:
            count += 1
        assert count == 2

The test is counting the records, not using len or cardinality, and that helps me notice an issue. Let’s make that test fail:

    def test_uses_scope_set(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        scopes = XSet.from_tuples(((1, 1), (2,2)))
        ff = XFlatFile(path, fields, scopes)
        ff_set = XSet(ff)
        count = 0
        for _e, _s in ff_set:
            count += 1
        assert count == 2
        assert len(ff_set) == 2

That will fail, saying 1000, because we compute the XFlatFile len by dividing down the file size. Sure enough:

Expected :2
Actual   :1000

Fix __len__:

    def __len__(self):
        if self.scope_set is not None:
            return len(self.scope_set)
        file_length = stat(self.full_file_path).st_size
        return int(file_length / self.record_length)

Test passes. I think this is a bit iffy, in that the scope set could be totally weird, in principle. Make a note of that. Commit: XFlatFile with scope set returns length of scope set as cardinality.

Now let’s do the re_scope method. Let’s have a test for it:

    def test_re_scope(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        r100 = ff.element_at(100)
        r900 = ff.element_at(900)
        ff_set = XSet(ff)
        assert ff_set.includes(r100, 100)
        assert ff_set.includes(r900, 900)
        scopes = XSet.from_tuples(((100, 1), (900, 2)))
        re_scoped = ff_set.re_scope(scopes)
        assert len(re_scoped) == 2
        assert re_scoped.includes(r100, 1)
        assert re_scoped.includes(r900, 2)

Here we just fetch record 100 and 900 from the flat file and then rescope, assuring ourselves that we get the two records, with new scopes, is the result. At this point, the re_scope is still iterating the set, because we haven’t implemented a specialized one. Let’s commit the test: new test for re_scope in flat file.

Now let’s build re_scope in XFlatFile.

Let’s ensure that the re_scoping set refers only to records we have. Do we need to require the new scopes to be integers? We haven’t tested anything else. No, let’s hold off on the verification and just assume the re_scoping_set is valid. Shortest path to working.

    def re_scope(self, re_scoping_set):
        new_impl = self.__class__(self.full_file_path, self.fields, re_scoping_set)
        return  XSet(new_impl)

Hm, I really expected that to work. Oh my! The __iter__ is wrong:

    def __iter__(self):
        def lots():
            n = 1
            while True:
                yield n, n
                n += 1

        it = iter(self.scope_set) if self.scope_set else lots()
        for _e, scope in it:
            rec = self.element_at(scope)
            if rec is None:
                return
            yield rec, scope

We are assuming here that we should fetch the element at scope, not the element component. There must be a mistaken test.

    def test_uses_scope_set(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        scopes = XSet.from_tuples(((1, 1), (2, 2)))
        ff = XFlatFile(path, fields, scopes)
        ff_set = XSet(ff)
        count = 0
        for _e, _s in ff_set:
            count += 1
        assert count == 2
        assert len(ff_set) == 2

Sure enough. I used a set that has the same scopes and elements. Also the test isn’t terribly robust anyway. Fix the __iter__ and we should be green.

    def __iter__(self):
        def lots():
            n = 1
            while True:
                yield n, n
                n += 1

        it = iter(self.scope_set) if self.scope_set else lots()
        for old_scope, new_scope in it:
            rec = self.element_at(old_scope)
            if rec is None:
                return
            yield rec, new_scope

We are still not green. Tempted to roll back but I want more info. Add some prints.

XSet(XFlat('jeffries    sam         architect      12000')) 100
XSet(XFlat('iam         sam         architect      12000')) 900
XSet(XFlat('iam         sam         architect      12000')) 2
XSet(XFlat('jeffries    sam         architect      12000')) 1

The first two there are the records found at 100 and 900. The second two are the two records from the output set. The test failure is:

>       assert re_scoped.includes(r100, 1)
E       AssertionError: assert False
E        +  where False = <bound method XSet.includes of XSet(XFlatFile(/Users/ron/Desktop/job_db))>(XSet(XFlat('jeffries    sam         architect      12000')), 1)
E        +    where <bound method XSet.includes of XSet(XFlatFile(/Users/ron/Desktop/job_db))> = XSet(XFlatFile(/Users/ron/Desktop/job_db)).includes

I should use more random numbers than 100 and 900 but the two desired records are clearly there. Is there an issue with includes? I’ll wager there is: it does not honor scope set. Let’s see:

class XFlatFile:
    def __contains__(self, item):
        de, ds = item
        return de == self.element_at(ds)

Yeah, that’s not going to work if we have a scope_set.

What happens if I comment it out? Things will slow down but should work.

When I do that, I discover that there is a pervasive mistake in my code. Or misunderstanding. I’m going to roll back and address this again. Too many balls in the air. All fall down.

Rollback!!

One issue is in my abstract base class, XImplementation. It defaults all the required methods to `raise NotImplemented. That is not how you do it. NotImplemented is not an exception. They should raise NotImplementedError.

Let’s fix that. Green. Commit: abstract methods fixed to raise NotImplementedError.

I noticed when I removed the __contains__ that the main includes method looks like this:

    def includes(self, element, scope) -> bool:
        if scope is None:
            scope = self.null
        try:
            return (element, scope) in self.implementation
        except NotImplemented:
            return any(e == element and s == scope for e, s in self)

That, too, should be NotImplementedError. Fix. Green. Commit: check NotImplementedError exception in includes.

We also found that bug in the use of the scope set. Was the fix committed? No, do again:

    def __iter__(self):
        def lots():
            n = 1
            while True:
                yield n, n
                n += 1

        it = iter(self.scope_set) if self.scope_set else lots()
        for old_scope, new_scope in it:
            rec = self.element_at(old_scope)
            if rec is None:
                return
            yield rec, new_scope

I’m going to commit this, because I know it’s right, but we need a stronger test than the one we have. Commit: fix iter to read scope_set element, output scope-set scope.

Here’s the test for the use of scope-set in XFlatFile.

    def test_uses_scope_set(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        scopes = XSet.from_tuples(((1, 1), (2,2)))
        ff = XFlatFile(path, fields, scopes)
        ff_set = XSet(ff)
        count = 0
        for _e, _s in ff_set:
            count += 1
        assert count == 2
        assert len(ff_set) == 2

Let’s read something other than records 1 and 2:

    def test_uses_scope_set(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        scopes = XSet.from_tuples(((107, 1), (932, 2)))
        ff = XFlatFile(path, fields, scopes)
        ff_set = XSet(ff)
        e, s = ff_set.select(lambda e, s: s == 1).pop()
        assert e.includes('amy', 'first')
        e, s = ff_set.select(lambda e, s: s == 2).pop()
        assert e.includes('janet', 'first')
        assert len(ff_set) == 2

I worked out the values by peeking.

Reflection

Tests like this are difficult to write. I seem to have to run them, print results, then fill in the blanks. Kind of an approval test but I didn’t really go look at those two records. The file doesn’t have newlines in the records, which isn’t helping.

Another issue is that we don’t have a good way of specifying that a set should contain a record matching x criteria at scope y. I’ll try to think about that.

Commit that test: improve XFlatFile scope-set test.

Finally

I think we’re ready for our custom-made re_scope:

class XSet:
    def re_scope(self, other) -> Self:
        try:
            return self.implementation.re_scope(other)
        except AttributeError:
            return self.generic_re_scope(other)

And, again:

    def re_scope(self, re_scoping_set):
        new_impl = self.__class__(self.full_file_path, self.fields, re_scoping_set)
        return XSet(new_impl)

And my test fails! I am quite surprised. Oh, right, the __contains__ issue. Remove __contains__ and we should run green. Yes. So the scoping works. But contains does not honor scope_set if present:

class XFlatFile(XImplementation):
    def __contains__(self, item):
        if self.scope_set is not None:
            raise NotImplementedError
        de, ds = item
        return de == self.element_at(ds)

That passes, because our basic contains fields that exception and does it all the hard way. We ca commit: XFlatFile implements re_scope by returning a scoped XFlatFile, not actual records.

Let’s reflect, this has been intense.

Reflection

Discovering that mistake with NotImplemented, plus the one in __contains__, plus whatever else was going on, has kind of left me a bit uncertain. I’m not sure whether we have dotted all the t’s and crossed all the i’s that we need to. Certainly we will want to improve __contains__ in XFlatFile: I’ve made a note of that.

It’s odd that I had forgotten that XFlatFile was already (almost) supporting the scope set. Somewhere I must have started that and then gotten derailed. Anyway, it was certainly one of the things we needed, or at least I think it was. We did use it. But I wonder whether a different approach might have been better.

Sub-Reflection on Alternatives

Before I stumbled onto the existing scope set in XFlatFile, I was thinking of a somewhat different alternative, a ViewSet, something along the lines of a general type of XImplementation or XSet, not clear which, that contains a source set and a re_scoping set. The source set would typically be an n-tuple of some kind, so that the ViewSet would, during its iteration, return random records from the set it views.

We have sort of done that, but only for XFlatFile instead of a general ViewSet.

With a general ViewSet notion, I think we would be opening a can of worms, but I’m not sure whether they are tasty, like shrimp, or yucchy, like whatever kind of worm you were thinking of. As things stand, there is just one kind of XSet and hidden inside it there are various kinds of XImplementations, all of which respond to some fundamental methods, and which may if they choose respond to other set methods.

The current scheme doesn’t do anything recursive or otherwise fancy. Except that XFlatFile, an XImplementation, has an XSet in it, which it can and does iterate. This makes me wonder whether that is a legitimate thing to do—by which I mean whether its a good idea—or whether we should disallow it, or whether we should throw caution to the winds and somehow combine the XSet and XImlementation hierarchies.

That way lies either great power, or great confusion. Why not both? We must think about this.

What else? We have not validated the re-scoping set. Note made.

I suspect the behavior of XFlatFile will be weird if it is asked to return something outside its range. It should probably return a null set. Note made.

All this work is in aid of providing some indexing operations. What we have here is part of that: a re_scoped XFlatFile implementation provides a set based on the same flat file as the original set, but it only sees the records indicated by the scope set. So it is a “view” of the original set, containing only the requested records. So … if we were to do, say, a restrict operations finding all the serfs, and save, not the records but the scopes, making a new re-scoped set, the next time we need that selection, we have the work done and can just return the same view. And there is little doubt in my mind that a view of a view would work just fine.

We will explore all that as we go forward. For now, we have a bare-bones implementation of a rather powerful capability. We can re-scope a flat file so produce a view of that file that returns only the records selected by the re-scoping operation. We can return them with their old record numbers, or with any other numbering we might prefer. In fact, we could return them with string names. Or sets, but let’s not do that.

This is a good step forward. We have a view of a large set. Nice.

See you next time!