FAFO on GitHub

We have a long-form cardinality method. Let’s use the len function and require it as part of the implementations.

Hello, friends!

I implemented this, just because I needed it, without thinking about it much at all:

class XSet:
    def cardinality(self):
        count = 0
        for _ignored in self:
            count += 1
        return count

Nothing good about that. Let’s require our sets to implement the len function.

class XImplementation(ABC):
    # @abstractmethod can be implemented if desired
    def __contains__(self, item):
        raise NotImplemented

    @abstractmethod
    def __iter__(self):
        raise NotImplemented

    @abstractmethod
    def __hash__(self):
        raise NotImplemented

    @abstractmethod
    def __len__(self):
        raise NotImplemented

    @abstractmethod
    def __repr__(self):
        raise NotImplemented

Will PyCharm tell me about my violations of this rule? It will. Also all my tests fail until I get most of the new methods implemented. I just used len(whatever_data) for most of them. To get things going, I did XFlatFile longhand.

I kept cardinality:

class XSet:
    def cardinality(self):
        return len(self)

    def __len__(self):
        return len(self.implementation)

Commit: all XImplementations can return len(), used optionally with cardinality.

Now let’s see about XFlatFile, which I just did this way:

class XFlatFile(XImplementation):
    def __len__(self):
        count = 0
        for _i in self:
            count += 1
        return count

I think we should have a test for this. Ah, we have:

    def test_waste_memory(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ee = XSet(ff)
        assert ee.cardinality() == 1000
        jeffries = ee.select(lambda e, s: e.includes('jeffries', 'last'))
        assert jeffries.cardinality() == 200
        ...

So we can see if we can get the file length and use that.

class XFlatFile(XImplementation):
    def __len__(self):
        file_length = stat(expanduser(self.file_path)).st_size
        return int(file_length / self.record_length)

This works. The code does not deal with anything quaint like other than plain 8-bit text files. But it should work well for those and they are all we have at the moment.

Commit: Get cardinality from file length.

I noticed that I am expanding user in more than one place. Let’s do that once and for all.

Here’s the current XFlatFile:

class XFlatFile(XImplementation):
    def __init__(self, file_path, fields, scope_set=None):
        self.file_path = file_path
        self.full_file_path = expanduser(file_path)
        self.fields = fields
        field_def = self.fields[-1]
        self.record_length = field_def[-1]
        self.scope_set = scope_set

    def __contains__(self, item):
        de, ds = item
        return de == self.element_at(ds)

    def __iter__(self):
        def lots():
            n = 1
            while True:
                yield n, n
                n += 1

        it = iter(self.scope_set) if self.scope_set else lots()
        for _e, scope in it:
            rec = self.element_at(scope)
            if rec is None:
                return
            yield rec, scope

    def __hash__(self):
        return hash((self.full_file_path, self.fields))

    def __len__(self):
        file_length = stat(self.full_file_path).st_size
        return int(file_length / self.record_length)

    def __repr__(self):
        return f'XFlatFile({self.file_path})'

    def get_record(self, index):
        seek_address = index*self.record_length
        with open(self.full_file_path, "r") as f:
            f.seek(seek_address)
            rec = f.read(self.record_length)
        return rec

    def element_at(self, scope):
        if not isinstance(scope, int) or scope < 1:
            return None
        rec = self.get_record(scope - 1)
        if rec == '':
            return None
        return XSet(XFlat(self.fields, rec))

    def rename_contents(self, re_scoping_set):
        new_names = []
        for name, start, len in self.fields:
            changed_name = name
            for old, new in re_scoping_set:
                if name == old:
                    changed_name = new
            new_names.append((changed_name, start, len))
        new_impl = self.__class__(self.full_file_path, new_names, self.scope_set)
        return XSet(new_impl)

I kept the file_path member so that __repr__ doesn’t display my true path, so that no one will ever find out that my Mac user name is fuzzy-caterpillar or whatever it is.

Commit: Expand file path just once.

Summary

Just a small improvement, but surely the cardinality / len operation on the XFlatFile is thousands of times faster. Most of the others, as well. 99 tests run in 143 ms currently. Not bad.

I think that a thing that might be interesting would be another XImplementation, XRelation, a set where all the records have the same shape, with named field names like the examples we’ve been using. But the names would be kept just once, in the top-level set, and the field values in a simple list (or XTuple?) and spun out with the proper field names upon iteration.

That would save mass quantities of storage and probably some processing time as well.

The XRelation type would be closed under a number of operations, if one were careful, such as project, restrict, and select.

Another thing I am tempted to try is some explicitly nested sets, such as a person record with a name element containing first and last, or a contact with address, email, phone. One might even consider a name element with varied contents, since some people’s names do not really parse into first / last.

Another area of importance would be general grouping, such as collecting all the employees into job groups. Grouping is particularly valuable when calculating statistics such as average pay by job title.

Oh, and here’s what may be a good one. We presently do select by producing the entire selected set. What if, instead, we were to put a filter on top of the source set, with an iterator that just returns one record at a time? Would that be better in any important way?

Lots to do. I still rather wish I had an application in mind. Having fun tho.

Perhaps it’s time to draw some serious lessons from all this? I think there are some lurking in here somewhere.

See you next time!