Any Idea?

FAFO on GitHub

I was whining earlier about the difficulty of writing tests that assert about elements inside a result set. A thought has come to me.

Take this test, for example:

    def test_uses_scope_set(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        scopes = XSet.from_tuples(((107, 1), (932, 2)))
        ff = XFlatFile(path, fields, scopes)
        ff_set = XSet(ff)
        e, s = ff_set.select(lambda e, s: s == 1).pop()
        assert e.includes('amy', 'first')
        e, s = ff_set.select(lambda e, s: s == 2).pop()
        assert e.includes('janet', 'first')
        assert len(ff_set) == 2

I don’t know much about the two elements that it selects, but I do know that those elements have first name ‘amy’ and ‘janet” respectively. It occurs to me that I could say this:

    def test_uses_scope_set(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        scopes = XSet.from_tuples(((107, 1), (932, 2)))
        ff = XFlatFile(path, fields, scopes)
        ff_set = XSet(ff)
        assert len(ff_set) == 2
        assert any(s == 1 and e.includes('amy', 'first') 
            for e,s in ff_set)
        assert any(s == 2 and e.includes('janet', 'first') 
            for e,s in ff_set)

That’s certainly a bit more clear and a bit more convenient. We could also say this:

        assert all(e.includes('amy', 'first') 
            for e,s in ff_set if s == 1)
        assert all(e.includes('janet', 'first') 
            for e,s in ff_set if s == 2)

These are still a bit verbose but either one is a bit better than what we had. Neither formulation is ideal:

In the any formulation, there could be any number of other elements with scope of one or name of amy. In the all formulation, there could be any number of ‘amy’ records, just so long as their scopes are not 1. So the tests still aren’t as expressive and robust as we might like.

I’m not entirely happy with those, but they are perhaps better than what I had. What I think I’d like to say is something like “this set includes exactly one record at scope s and it has first == amy”, and I’d like to say it without a lot of rigmarole.

We’ll think about that. Let’s move on to issues from this morning.

What about __len__if XFlatFile scope_set is weird?
Improve __contains__ in XFlatFile
Validate scope_set in XFlatFile re_scope
XFlatFile out of range should return … null_set?

Let’s move to validate the scope set. Why? Because we know that for an XFlatFile, the only valid elements of a re-scoping set are integers. No other elements can possibly return a record.

No, wait. We know that’s true for an XFlatFile with its original scope_set of None. But could we rescope the records to strings? Why not? Let’s try a test:

    def test_scope_set_to_string(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        r100 = ff.element_at(100)
        r900 = ff.element_at(900)
        ff_set = XSet(ff)
        scopes = XSet.from_tuples(((100, "fred"), (900, "ethel")))
        re_scoped = ff_set.re_scope(scopes)
        assert len(re_scoped) == 2
        assert re_scoped.includes(r100, "fred")
        assert re_scoped.includes(r900, "ethel")

The test passes. But what happens if we re-scope it again? Well, it doesn’t work, but worse than that, it won’t even work with integer scripts. Why? Because when we create our new XFlatFile, we just replace the existing scopes with our new scope set. So with the string ones, we’ll be looking for record number ‘fred’ and never find it.

Can it be made to work? What if we transitively apply our incoming scope set to the scope set of the subject set.

What do I even mean by that?

In our test above, when we re-scope 100 to 1 and 900 to 2, the resulting set will iterate thru the scope set, fetch record 100, and return that with scope 1, and so on. That will work, the first time, because we have 100 -> fred, which is correct. So in the second step, fred->frank, we want to transform that old-new element into 100->frank, and the ether->premium into 900->premium.

We want some operator X so that

{100^fred, 900^ethel} X {fred^frank, ethel^premium}

yields

{100^frank, 900^premium}

Nothing to it, he said optimistically. We want a new scope set of no more than the same number of elements as the incoming one (the fred->frank one) where the element of the new item equals the element part of the item in the original scope set with the scope equal to the element part of the new scope set. I think we’ll find that that is exactly re-scope.

Let’s write a test:

    def test_double_re_scope(self):
        scopes = XSet.from_tuples(((100, "fred"), (900, "ethel")))
        new_scopes = XSet.from_tuples((('fred', 'frank'), ('ethel', 'premium')))
        net_scopes = scopes.re_scope(new_scopes)
        expected = XSet.from_tuples(((100, 'frank'), (900, 'premium')))
        assert net_scopes == expected

That passes. Therefore:

    def re_scope(self, re_scoping_set):
        if self.scope_set is not None:
            re_scoping_set = self.scope_set.re_scope(re_scoping_set)
        new_impl = self.__class__(self.full_file_path, self.fields, re_scoping_set)
        return XSet(new_impl)

And this test runs:

    def test_scope_set_to_string(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        r100 = ff.element_at(100)
        r900 = ff.element_at(900)
        ff_set = XSet(ff)
        scopes = XSet.from_tuples(((100, "fred"), (900, "ethel")))
        re_scoped = ff_set.re_scope(scopes)
        assert len(re_scoped) == 2
        assert re_scoped.includes(r100, "fred")
        assert re_scoped.includes(r900, "ethel")
        new_scopes = XSet.from_tuples((('fred', 'frank'), ('ethel', 'premium')))
        re_re_scoped = re_scoped.re_scope(new_scopes)
        assert re_re_scoped.includes(r100, "frank")
        assert re_re_scoped.includes(r900, "premium")

One more issue. I think this test will crash:

    def test_non_integer_re_scope(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ff_set = XSet(ff)
        scopes = XSet.from_tuples((("hello", "fred"), (13.5, "ethel")))
        re_scoped = ff_set.re_scope(scopes)
        assert re_scoped == XSet.null
        assert len(re_scoped) == 0

It doesn’t crash. But it does think the length of the result is 2, even though it is null. Why? Because it has two useless elements in its scope set and we compute len using the length of the scope set. So we must validate the incoming scope set.

It takes me a bit of thrashing but I get this:

class XFlatFile:
    def re_scope(self, re_scoping_set):
        if self.scope_set is not None:
            re_scoping_set = self.scope_set.re_scope(re_scoping_set)
        re_scoping_set = self.validate_scope_set(re_scoping_set)
        if len(re_scoping_set) == 0:
            return XSet.null
        new_impl = self.__class__(self.full_file_path, self.fields, re_scoping_set)
        return XSet(new_impl)

    def validate_scope_set(self, re_scoping_set):
        return re_scoping_set.select(lambda e, s: type(e) is int)

This is still not ideal, because we would really like to ensure that all the incoming integers are greater than zero and less than the length of the file. Here’s __len__:

    def __len__(self):
        if self.scope_set is not None:
            return len(self.scope_set)
        file_length = stat(self.full_file_path).st_size
        return int(file_length / self.record_length)

Extract method:

    def __len__(self):
        if self.scope_set is not None:
            return len(self.scope_set)
        return self.file_length_in_records()

And in validate:

    def validate_scope_set(self, re_scoping_set):
        maximum = self.file_length_in_records()
        return re_scoping_set.select(lambda e, s: type(e) is int and 0 < e <= maximum)

We are green, let’s commit this. It’s getting hairy here. Commit: XFlatFile now computes transitive closure over scope set, allowing stacking of re-scoping and non-numeric output scopes. I do not advise using non-numeric scopes.

We should have a more robust test for the range.

    def test_non_integer_re_scope(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ff_set = XSet(ff)
        scopes = XSet.from_tuples((("hello", "fred"), (13.5, "ethel"), (-1, "neg"), (10000, "big")))
        re_scoped = ff_set.re_scope(scopes)
        assert len(re_scoped) == 0
        assert re_scoped == XSet.null

Green. Commit: beef up test.

I wonder whether there is a way to generate a re-scoped XFlatFile that is empty but does not equal XSet.null. I’ll add that to the notes.

I think we’ve dealt with most of these:

~~What about `__len__` if XFlatFile scope_set is weird?~~
Improve __contains__ in XFlatFile
~~Validate scope_set in XFlatFile re_scope~~
~~XFlatFile out of range should return ... null_set?~~
possible to make scoped XFF that is empty but not null?

Let’s sum up.

Summary

I’ve shown some “better” ways to assert about the contents of sets containing sets, but none are really leaping out at me as being wonderful. Waiting is, I guess.

Then improvements to the re-scoping capability of XFlatFile were interesting, including support for result scopes that are not numeric, which is interesting and should work, but which should probably be used quite sparingly.

We’ve implemented our XSets so that the elements and scopes can be most anything that can be hashed. We have in mind limiting them to string and numbers but I would like the underlying code to support arbitrary elements and scopes throughout. However, I think it quite unwise to use very much of that capability, because nested sets very quickly become quite hard to think about.

A thing to wonder about is why, if we’re not going to make use of complicated sets, we should allow them to exist at all. Might our program be somehow easier to write or to understand without that generality. I’m not sure of an answer to that, but I am inclined to think that, by and large, what we have here is a simplifying kind of generality that makes solutions easier to create, not harder.

One example came up today, with the observation that we can compute the scope set for a derived set by using the re-scoping operator itself. As soon as I looked at the set expression, re-scope came to mind, and it worked as soon as it was put in.

card showing set expression

I think things went well this afternoon, and I am pleased with scoped flat files on top of scoped flat files. I am still not sure that this is the right place for the capability: a more generalized view set might be more valuable. We’ll see. We are finding our way here, moving toward better as best we can.

See you next time!