FAFO on GitHub

What if a function IS a set instead of returning a set? This might be significant.

Hello again, friends!

This afternoon I’m thinking about set operators, which are presently functions that return a completed set. What if, instead, f(s1,s2) is a set, rather than a function returning a set? Today, if we write something like this:

    def test_select(self):
        def sel(e, s):
            print("checking", e, s, e > 3)
            return e > 3
        s1 = XSet.n_tuple((0, 1, 2, 3, 4, 5, 6))
        selected = s1.select(sel)
        assert (4, 5) in selected
        assert selected.includes(4, 5)

By the time we assign selected, the values from the select have already been computed. When we assert against them, the includes operator iterates through them:

class XSet:
    def select(self, cond) -> Self:
        tuples = list((e, s) for e, s in self if cond(e, s))
        return XSet.from_tuples(tuples)

    def includes(self, element, scope) -> bool:
        if scope is None:
            scope = self.null
        try:
            return (element, scope) in self.implementation
        except NotImplemented:
            return any(e == element and s == scope for e, s in self)

In the case of select the implementation is XFrozen, which works like this:

class XSet:
    @classmethod
    def from_tuples(cls, tuples):
        def is_2_tuple(a):
            return isinstance(a, tuple) and len(a) == 2

        frozen = frozenset(tuples)  # ensure we didn't get passed a generator
        if not all(is_2_tuple(a) for a in frozen):
            raise AttributeError
        return cls(XFrozen(frozen))

class XFrozen(XImplementation):
    def __init__(self, fs):
        self.data = fs

    def __contains__(self, item):
        return item in self.data

    def __iter__(self):
        for e, s in self.data:
            yield e, s

I should mention the iterator. I just went thru and changed some of the XImplementation iterators to be generators, which you can just do by replacing the creation of an iterator object with a function containing one or more yield statements. So that does mean that we could just read the first element from the set and not iterate all of them, but of course they are already tucked away in data.

The question I’m dealing with now is whether we can make the select function return a set that returns its elements and scopes one pair at a time. And the equally important question of whether we should do that. Or, perhaps better phrased, when, if ever, should we do that?

It’s clear that we shouldn’t always do it. Suppose that select was given a very large set from which to select elements. If the resulting set were to be iterated twice, the large set would be scanned again, even if the resulting output was just a few elements. It would be quite inefficient if we were to do that.

So, at least sometimes, we probably want to create the real set.

On the other hand, imagine a series of selection operations, staring with some large set, winnowing it down at each subsequent select. Suppose at the top of the tree, we just request that the few records at the top get printed or tucked away somewhere. In that case, doing the selections without saving the intermediate results would be ideal, or so it seems to me.

I am not yet sure just how to think about this. I think I’ll begin by making all the implementations work as generators, which seems not to hurt. They’ll produce their data in a more lazy fashion, but they’ll already have the actual data.

Most of them are straightforward. What about XFlatFile though:

class XFlatFile(XImplementation):
    def __init__(self, file_path, fields, scope_set=None):
        self.file_path = file_path
        self.fields = fields
        field_def = self.fields[-1]
        self.record_length = field_def[-1]
        self.scope_set = scope_set

    def __iter__(self):
        def lots():
            n = 1
            while True:
                yield n, n
                n += 1
        if self.scope_set:
            return XFlatFileIterator(self, iter(self.scope_set))
        else:
            return XFlatFileIterator(self, lots())

I will try this:

    def __iter__(self):
        def lots():
            n = 1
            while True:
                yield n, n
                n += 1
        if self.scope_set:
            it = iter(self.scope_set)
            for _e, scope in it:
                rec = self.element_at(scope)
                if rec is None:
                    return
                yield rec, scope
        else:
            it = lots()
            for _e, scope in it:
                rec = self.element_at(scope)
                if rec is None:
                    return
                yield rec, scope

Tests are green. There appears to be some duplication. Let’s fix that.

    def __iter__(self):
        def lots():
            n = 1
            while True:
                yield n, n
                n += 1
                
        it = iter(self.scope_set) if self.scope_set else lots()
        for _e, scope in it:
            rec = self.element_at(scope)
            if rec is None:
                return
            yield rec, scope

Green. Commit: convert iterator objects to generator functions.

Reflection

So. I think that all the XImplementation objects now use generators to produce their records. The general outlook is that for small collections, generators are slower, but they are faster for larger collections. And for small collections, they’re fast enough that for our purposes, it surely doesn’t matter.

If someone were to ask me “so what?”, I’m not sure how I’d answer. I think we can see some places where we might shortcut processing some information, but all the sets still materialize all their data.

The only big set we have, the XFlatFile, already used an incremental iterator:

class XFlatFileIterator:
    def __init__(self, flat_file, generator):
        self.file = flat_file
        self.scope_gen = generator

    def __iter__(self):
        return self

    def __next__(self):
        _element, scope = next(self.scope_gen)
        element_tuple = (rec := self.file.element_at(scope), scope)
        if rec is None:
            raise StopIteration
        else:
            return element_tuple

So flat files, the only big and slow set, just read as many records as were actually used anyway. And the others, XFrozen, XFlat, and XTuple, contain all their data upon creation. I’m not sure what the next step should be. I’m leaning toward working with select to return something more dynamic than a thing that creates all the elements as now happens:

class XSet:
    def select(self, cond) -> Self:
        tuples = list((e, s) for e, s in self if cond(e, s))
        return XSet.from_tuples(tuples)

I am hopeful that we can get from where we are to a situation where sets are streamed rather than completely materialized, without major revisions. I am open to the possibility of revision, however, if that’s what it takes.

Still … this feels significant. I hope my readers, if any, are enjoying the ride.

See you next time!