Streaming Sets

FAFO on GitHub

There’s no doubt that we can create sets or set operations that stream their results, avoiding creating large temporary sets in memory. There are issues to think about. First experiment works!

I see two “issues” with regard to streaming sets, beyond the initial simple question of whether it’s possible at all.

Can We Do It?

#Often clearly possible

Almost all our current set operations create a full result set in memory. Given that a Python iterator can be written in coroutine form, yielding one item at a time, it’s easy to see¹ that many operations can be implemented that way, so that they’ll stream their results without the need to create a physical result. The benefits would be reduced memory impact, and, I believe, faster execution.

Sometimes it’s not clear

Some operations, such as those that select a subset of an existing set, like our select or restrict operations, can rather clearly be streamed. We’ll just call next on the input until we get to a tuple that passes the criterion and we’ll yield it, rinse, repeat. Other operations are more problematical. Set difference is an example. We can stream the left side, but for each record on the left, we need to ask whether it exists in the set on the right. I guess we can still stream, but if we do, might we not trigger an arbitrarily deep restart of a pipeline? Almost certainly, caching the right-side set will be desirable.

Similar issues arise for operations like symmetric difference. An operation like union can be managed in at least two ways:

The most obvious thing to do with a union is to remove duplicates, which requires either a search or some kind of sort/merge operation;
Less obvious is the fact that { a, b, a } equals { a, b }. We could just carry the duplicates around for a while and winnow them out only as needed by some end user view or the like.

Sometimes nearly impossible

It seems likely that there are operations that simply cannot realistically be streamed without first creating at least some kind of auxiliary structure. So be it, one does what one must.

Impact on Design

Rather more concerning are the questions of what these streaming sets are, what our existing XSets and XImplementations are, and how they fit together.

Review of present design

Our XSet class is the official representation of an Extended Set, a collection of tuples representing element-scope combinations. The XSet class implements all the set operations that we support, as methods. To get c to be the union of sets a and b, we would write something like:

c = a.union(b)

The XSet set operation methods, such as union, are all rather small, and are rather close to the formal set-theoretic definition of the operation. And, as presently written, they all assume that they can run in memory. For example, union, mentioned above, is implemented in terms of Python sets:

    def union(self, other):
        mine = set(self)
        others = set(other)
        both = mine | others
        return XSet.from_tuples(both)

We just load the sets into Python set instances and union them, returning the result back wrapped in an XSet instance.

The XImplementation subclasses, on the other hand, represent physical storage for XSets. The underlying construct in our XSets is a 2-tuple, representing element^scope, where element and scope can be anything hashable, including XSets. We conventionally use strings and numbers as base elements and scopes.

The XImplementations include:

XFrozen: This is the base implementation of XSets (at present). It is a Python frozenset of tuples. The tuples can be elementary element-scope 2-tuples, or any more complex combination of XSet elements and XSet tuples. Since all XSets must be hashable, the XFrozen implementation can hold any conceivable XSet, so long as memory does not run out.
XFlatFile: This object holds a file name and a symbol table of scope names and lengths, in order, mapping the records of the input file. Records of an XFlatFile are inevitably simple sets of named scopes. XFlatFile either assumes that its records are numbered 1 through N, or contains a scope-set, which maps physical record numbers to new scopes. In the latter form, an XFlatFile instance can represent a subset of the file, accessing records randomly.
XFlat: This object represents a single record from XFlatFile. It has the same symbol table as the XFlatFile, one of whose records it represents, and when iterated, produces the fields of that record as tuples.
XTuple: This specialized object represents a set with scopes 1 through N, with exactly one element at each scope value. It’s an ordered n-tuple. XTuple is currently used only in tests. Unless I am mistaken, which is always possible, there are no set operations that create XTuple instances, and no special operations taking advantage of them, with the exception that the XTuple can answer quickly whether a given element-scope combination is in the set.
XCalculation: This is the newest of the XImplementation family. An XCalculation’s input are a set of “records”, and a collection of expressions, which are provided in human-readable form when the XCalculation is created, and stored in a Reverse Polish Notation form. When the set is iterated, the XCalculation produces new records, consisting of each input record, unioned with the calculated values. XCalculation can be seen as an example of a “streaming set”, in that it fetches a record and scope from its input, creates one new record and yields it, never creating in memory, the new set that would consist of the base record plus calculated values. XCalculation has not yet been moved to the “production” side of things, as it is still under development.

All the XImplementation subclasses use yield in their implementation of __iter__. As such, it seems to YT² that they will not get in the way of streaming.

XSet operations create concrete sets

The existing XSet operations generally return concrete sets that will be placed in memory. The operation will thrash tuples around as needed, creating a new set of tuples, and then wrap them in an instance of XSet. Generally speaking, they’ll be created in the most generic XImplementation form, XFrozen.

Our mission with streaming sets will be to change this scheme. Let’s look at an example: restrict.

    def restrict(self, restrictor) -> Self:
        def other_has_match(e, s):
            return any((e_r.is_subset(e) for e_r, s_r in restrictor))

        if not isinstance(restrictor, self.__class__):
            return NotImplemented
        return self.select(other_has_match)

    def select(self, cond) -> Self:
        tuples = ((e, s) for e, s in self if cond(e, s))
        return XSet.from_tuples(tuples)

We see that we set up a condition to check for, and use select to find all the records meeting the condition. And select returns a concrete XSet, whose implementation will be XFrozen. It’s worth noting that the condition iterates the restrictor set for each record in the main set, to see whether it matches any of the restricting elements. On the surface, there’s no way around that. Beneath the surface, we could imagine some hashed or b-treed or binary search kind of thing, but for now, it’s searched. It’s generally a very small concrete set, so probably we don’t care.

How might we allow this operation to stream? If select returned an XSet whose XImplementation was an XSelect (new kind of implementation just being invented right now), could that object stream?

On the surface, yes. XSelect would have a base set, and a cond condition. (The condition is interesting. It’s a function or bound method or something. Is it properly holding on to the restrictor such that we could be sure that each XSelect accesses the right restrictor? Hard for me to think about.) I think we’d want to test that a bit. More than a bit.

If streaming is in the XImplementations

Could this work in all the cases? (Assuming that it would work at all.) Might we put all the streaming options in as XImplementation types, hiding under the covers, created by increasingly intelligent set operations in XSet?

It might be worth a try. It would be worth a try. There is another issue.

Higher-level optimization

We still have, on our fantasy list, the ability for the program to consider a series of operations and rearrange it into a more optimal form, perhaps based on knowledge of existing structures or the like. One would expect that kind of thing to require some kind of syntax tree, showing the sequence of operations, and subject to rearrangement or substitutions.

I am worried, nay only concerned, nay only wondering about how this would fit into the current scheme. Should there still be the XSet-XImplementation layer? Would optimization be a third layer? Would it somehow weave into the existing layers?

Way too soon to know. I do think about it, however. Thinking is good.

Action Time

The thinking above has gelled to the point where I am inclined to try something, namely XSelect, a new XImplementation, as hinted at above. And hinted is exactly right, because pretty much everything I know about it is contained in those few words above.

We will build it with tests, of course. We recognize that a key test will be two restrict operations that can be used at the same time, showing that the restricting sets don’t compete. We’ll have to work up to that over a few tests. Let’s get right to it, in a new test file.

    def test_plain_select(self):
        def cond(e, s):
            return 'x' in e

        source = XSet.n_tuple(('abc', 'dxef', 'ghi', 'xyz'))
        selected = source.select(cond)
        assert selected[2] == 'dxef'
        assert selected[4] == 'xyz'

That’s a simple test of the existing select, just to get the little grey cells working. Now I guess I’ll try building an XSelect here in the test file.

    def test_x_select(self):
        def cond(e, s):
            return 'x' in e

        source = XSet.n_tuple(('abc', 'dxef', 'ghi', 'xyz'))
        x_select = XSelect(source, cond)
        it = iter(x_select)
        assert next(it) == ('dxef', 2)
        assert next(it) == ('xyz', 4)

Let’s see about creating the class with some help from PyCharm:

class XSelect(XImplementation):
    def __init__(self, base, cond):
        self._base = base
        self._cond = cond

    def __iter__(self):
        for e, s in self._base:
            if self._cond(e, s):
                yield e, s

    def __hash__(self):
        return hash(self._base)

    def __len__(self):
        return 0

    def __repr__(self):
        return f'XSelect({self._base})'

The test runs! We need something about __len__. I guess we have to just do it:

    def __len__(self):
        len = 0
        for _t in self:
            len += 1
        return len

And yes, I did test it:

    def test_x_select(self):
        def cond(e, s):
            return 'x' in e

        source = XSet.n_tuple(('abc', 'dxef', 'ghi', 'xyz'))
        x_select = XSelect(source, cond)
        assert len(x_select) == 2
        it = iter(x_select)
        assert next(it) == ('dxef', 2)
        assert next(it) == ('xyz', 4)

I think I’ll commit: initial test and class for XSelect.

I’d like to get to a test holding on to two different conds containing sets this morning. Let’s add a new method, x_select, to XSet. We’ll test drive it:

    def test_through_XSet(self):
        source = XSet.n_tuple(('abc', 'dxef', 'ghi', 'xyz'))
        sel1 = XSet.from_tuples((('dxef', 2), ))
        sel2 = XSet.from_tuples((('xyz', 4), ))
        sub1 = source.x_select(lambda e, s: (e, s) in sel1)
        sub2 = source.x_select(lambda e, s: (e, s) in sel2)
        result = sub1 | sub2
        assert ('dxef', 2) in result
        assert ('xyz', 4) in result

Now I just need x_select in XSet:

    def x_select(self, cond) -> Self:
        from test_x_select import XSelect
        x_sel = XSelect(self, cond)
        return XSet(x_sel)

My test passes! Commit: experimental method x_select in XSet correctly builds and operates XSelect set.

I want to write another test just for fun:

    def test_cascade(self):
        def even(e, s):
            return e%2 == 0

        def fiveish(e, s):
            return e%5 == 0

        source = XSet.classical_set(range(21))
        fives = source.x_select(fiveish)
        even_fives = fives.x_select(even)
        expected = XSet.classical_set((0, 10, 20))
        assert even_fives == expected

Whee! Commit: another test just for fun.

Let’s back away slowly and sum up.

Summary

That last test passes two different conditions, with two different sets, creating two different XSelect sets, which, unioned, have selected the right elements from the common base set. I’m not sure why I ever wondered if it would work.

If I am not mistaken, when we replace the existing select with our x_select experiment, all our uses of select and restrict will be streaming their results without creating concrete intermediate sets.

What we have here is an interesting proof of concept, a new kind of XImplementation that does not contain its elements, it produces them on demand, yielding them tuple by tuple. Almost anything we do with an XSelect set will stream the whole thing, because in general, we are creating complete sets, but even so, with the old select we would build up the entire set and then start producing elements, while with x_select, we begin producing elements right away and never guild up the whole set.

The memory impact could be quite significant: imagine a large database of people, and we want a selection of all the serfs over a certain age³. It takes a lot of serfs to support a feudal kingdom and our existing select would copy all those records into memory and then copy the old serf records again. Using the new select, we would spin through all the records, passing each serf to the age selector, and accumulating only those … or accumulating nothing at all, if, say, we were just printing their names.

I think this is an interesting step forward. And I will freely grant that I do find it a bit difficult to think about what is happening and when it is happening, as we go through a series of set operations. What is important about that, I think, is that the set operations don’t seem to care whether I can envision their behavior clearly or not: they do what is expected.

I think this speaks to the power of this mathematical approach: each operation takes XSets in and creates XSets out. The individual methods are all very simple and straightforward, with simple set-theoretic definitions and implementations that follow that math precisely⁴.

This is a pleasing result. I only wish I had someone to pair with who could appreciate it as much as I can.

See you next time!

Is it really “easy to see” that we can create streaming sets? I hope so. In any case I am sure that we can, and plan to do it soon. If it’s not easy, I’ll abase myself for being too optimistic. But I’m not. ↩
Yours Truly. h/t Neal Stephenson, Snow Crash ↩
Doubtless the king wishes to set these aging serfs up with a nice retirement situation. It’s not like he’s a member of the Party Which Will Not Be Named. ↩
As far as I know. There is always the possibility that I’ve made a math mistake, or a programming mistake, or let something slip. There are currently 156 tests, so I am rather confident. The tests run in 130 milliseconds, by the way. ↩