Some Progress, Some Thoughts

FAFO on GitHub

In your absence, I made a simple but significant change. I had an interesting idea. And an important observation. And more.

Hello, friends!

This morning, we’ll talk about membership conditions, relations, as in relational database not cousins, and an idea about limitations. Then I’ll ask about usage.

Membership

Yesterday, after getting the XFlatFile to iterate properly, we stalled implementing __contains__, the method that determines whether a given element-scope pair is in the set. We had a failing test. Later in the day, after finishing the article, I came back and did a little work, with some success, including what I think is a small but important change.

Here’s the test that we didn’t get running yesterday, and a new one:

    def test_ff_contains(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        flat_set = XSet(ff)
        flat_iter = iter(flat_set)
        flat_rec, index = next(flat_iter)
        assert flat_rec.includes('jeffries', 'last')
        assert flat_set.includes(flat_rec, index)

    def test_ff_select(self):
        def sel(person, s):
            return (person.includes('iam', 'last') and
                    person.includes('taylor', 'first') and
                    person.includes('serf', 'job'))

        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        flat_set = XSet(ff)
        result = flat_set.select(sel)
        count = 0
        for person, s in result:
            count += 1
        assert count == 4

The first test was never correctly written yesterday. The last assertion always had flat_set instead of flat_rec so it could never have passed. I don’t think it would have passed had it been correct, at that time, because of the handling of __contains__.

We also have that second test, which happens to read the entire 1000 record set and return four records from near the end. That test makes me confident that the XFlatFile is in good shape.

The key change was this. The XSet.includes method used to say this:

    def includes(self, element, scope) -> bool:
        if scope is None:
            scope = self.null
        return (element, scope) in self.implementation

This deferred containment to the implementation, and required every implementation to implement __contains__. That requirement is currently reflected in the abstract class XImplementation. We’re about to change that requirement.

The new implementation of XSet.includes:

    def includes(self, element, scope) -> bool:
        if scope is None:
            scope = self.null
        return any(e == element and s == scope for e, s in self)

The good news is that this will work for any set that can iterate its elements and scopes, and, of course, they all can. The bad news is that it is an exhaustive search of the set, up to the point where it finds the element and scope.

As soon as this change was in place, the corrected test worked, and all our other tests continued to work. There is an important formal reason why that has to be the case: the formulation above is one phrasing of the mathematical definition of includes, so, given that our sets are valid implementations, it must work, and if it doesn’t work, our sets are not yet correctly implementing set theory.

In the particular case of XFlatFile, our _contains__ method, if we required it, would essentially replicate the one we now have in includes, and that one is able to handle any set, not just one type. My deferral to the implementation was a premature optimization, and not a wise one, because most sets we create will not likely have a fast way of checking membership. Some may, and if and when we ever implement such a set, we’ll see about making this includes more clever.

My tentative plan for that would be to have XSet.includes send a message to the implementation and via an exception or specialized return, the set would signify whether it was answering the question or not. If it does, we use the answer and if not, we do it the hard way. For now, we always do it the hard but reliable way.

In set theory and XST, the meaning of a set comes down to its “membership condition”, the formal criteria for what elements it will and will not have. We try to write each of our operations in those terms, for at least two key reasons:

The notation is generally very compact, and as such, relatively easy to write;
The underlying definition is formal, and as such, generates well-formed results.

When we do this right, we will fairly readily create a system of sets and operations that are fully interoperable and capable of expressing essentially any information query and result we might ever want. This is a very good thing. It is also a very dangerous thing, which we’ll talk about further in our third section on limitations.

For now, we have our XFlatFile set working. And we have some adjustment to do, based on the decision not to require __contains__ as one of the required methods of an XImplementation. And there’s plenty more to do beyond that, and we’ll come to it.

I do have one small concern, and that is that our definition of includes relies on all elements and scopes implementing ==. We may wish to think about that a bit. I’ve made a note.

Relations

Most database thinking today is in terms of “relations”, collections of records each of which has the same fields as every other. Typical columnar kind of thing you find in SQL or Excel or Numbers. Our current base form for sets has each record composed of a collection of pairs, element and scope, for each field. So if there is a field “name” in the relation PERS, and there are 10,000 records in PERS, there are 10,000 occurrences of the string “name” in the set. And 10,000 “phone”, and 10,000 “address” and everything from “age” to “zip”.

Arguably, that is a bit wasteful. It occurred to me this morning that we could have something similar to what XFlatFile has: a single set of symbols for the whole relation. It would be different, but similar.

Guessing at an implementation, we might just have an ordered tuple of the names, and each record would be an ordered tuple of the values in the same order, and we would unwind them into (e, s) during iteration.

We can identify operations on a relation that will return another relation. Essentially any sub-setting operation will do so, as will any projection. So it seems likely that we would save a lot of tuples and tiny objects if we were to implement an XRelation kind of XImplementation.

So that’s a nice idea that we’ll keep in mind.

Limitations

I had an insight, while thinking about all this, one that I don’t believe I’ve had before, at least not quite in this form:

Our XSet implementation, in order to be formally complete, must be able to support any Extended Set, no matter how nested and intricate it may be.

That said, any information system implemented using XSets can, and should be limited to the creation of sets which are useful and easy to understand.

In short:

Yes, this thing can create and process any set we want, no matter how complicated. We would be fools to use that capability to create arbitrarily complicated sets.

In my prior working with these ideas, alone or with dear friends and colleagues who I dragged along with me, I’ve felt adamant that what we built had to offer the complete power of set theory clear out at the end user side of things. We never really got there, and quite possibly would have done better to have tried something less grand. We’ll never know.

In the most successful XST effort I’ve ever been a part of, I was a sort of irritant, a driving force, but was two levels of management above the work, and I honestly do not know how that one was implemented. I don’t even remember what language was used, it was so long ago. The actual work was done by the brilliant people in the development team, and the product was finished after I left, not as willingly as I might have. The product worked, but the company was sold and the product was never released.

A real pity, and I regret the mistakes that I made that contributed to the bad parts. Knowing what I do now, I am sure we’d have done better, because we would have been continually releasing a growing product rather than holding it back until it was done.

That is all water under the dam and over the bridge and those boats have been crossed and bridges burned long ago. I try to live in the now, and not to hold on to regrets, but yeah that didn’t go as well as it might have and much of the pain people felt was surely my fault. I didn’t intend that, I forgive myself, and I hope others do as well.

But I digress.

The point is that yes, we do want this thing we’re building to be able to support arbitrarily complex well-formed sets, and yes yes yes, we vow to use that capability to build powerful not very complex sets that we can understand and make use of.

Usage

Speaking of usage, I am not really ready for applications, but I would like to push this effort in the direction of applications. So if you have some data-focused kind of problem in mind, that might yield a bit to processing in a sort of database kind of way, perhaps bring it to my attention.

One example that comes to mind is music. Music collections, the on-line in your computer kind of collections, tend to have complicated and redundant data structures, with missing data and various formats from various sources. I do not have enough experience with what’s out there to know what would be useful and what would be possible, but I can imagine an import process that would bring the information in, as sets, and allow a user to search and find things based on quite wide-ranging criteria.

Anyway, I’m thinking about applications. Large enough to prove a point, small enough to do in the comfort of one’s home. Ideas welcome.

Summary

So, there we are. A small change to a more formal notion makes difficult tests easy, and avoids some work that would have been tricky and mostly redundant. We have an idea that won’t add much complexity and will reduce the memory impact of many of our sets. We reflect on wise utilization of this powerful tool. We wonder about what to do with it.

And we sum up.

See you next time!