Generators All the Way Down?
What if a function IS a set instead of returning a set? This might be significant.
Hello again, friends!
This afternoon I’m thinking about set operators, which are presently functions that return a completed set. What if, instead, f(s1,s2) is a set, rather than a function returning a set? Today, if we write something like this:
def test_select(self):
def sel(e, s):
print("checking", e, s, e > 3)
return e > 3
s1 = XSet.n_tuple((0, 1, 2, 3, 4, 5, 6))
selected = s1.select(sel)
assert (4, 5) in selected
assert selected.includes(4, 5)
By the time we assign selected
, the values from the select have already been computed. When we assert against them, the includes
operator iterates through them:
class XSet:
def select(self, cond) -> Self:
tuples = list((e, s) for e, s in self if cond(e, s))
return XSet.from_tuples(tuples)
def includes(self, element, scope) -> bool:
if scope is None:
scope = self.null
try:
return (element, scope) in self.implementation
except NotImplemented:
return any(e == element and s == scope for e, s in self)
In the case of select
the implementation is XFrozen, which works like this:
class XSet:
@classmethod
def from_tuples(cls, tuples):
def is_2_tuple(a):
return isinstance(a, tuple) and len(a) == 2
frozen = frozenset(tuples) # ensure we didn't get passed a generator
if not all(is_2_tuple(a) for a in frozen):
raise AttributeError
return cls(XFrozen(frozen))
class XFrozen(XImplementation):
def __init__(self, fs):
self.data = fs
def __contains__(self, item):
return item in self.data
def __iter__(self):
for e, s in self.data:
yield e, s
I should mention the iterator. I just went thru and changed some of the XImplementation iterators to be generators, which you can just do by replacing the creation of an iterator object with a function containing one or more yield
statements. So that does mean that we could just read the first element from the set and not iterate all of them, but of course they are already tucked away in data
.
The question I’m dealing with now is whether we can make the select function return a set that returns its elements and scopes one pair at a time. And the equally important question of whether we should do that. Or, perhaps better phrased, when, if ever, should we do that?
It’s clear that we shouldn’t always do it. Suppose that select
was given a very large set from which to select elements. If the resulting set were to be iterated twice, the large set would be scanned again, even if the resulting output was just a few elements. It would be quite inefficient if we were to do that.
So, at least sometimes, we probably want to create the real set.
On the other hand, imagine a series of selection operations, staring with some large set, winnowing it down at each subsequent select. Suppose at the top of the tree, we just request that the few records at the top get printed or tucked away somewhere. In that case, doing the selections without saving the intermediate results would be ideal, or so it seems to me.
I am not yet sure just how to think about this. I think I’ll begin by making all the implementations work as generators, which seems not to hurt. They’ll produce their data in a more lazy fashion, but they’ll already have the actual data.
Most of them are straightforward. What about XFlatFile though:
class XFlatFile(XImplementation):
def __init__(self, file_path, fields, scope_set=None):
self.file_path = file_path
self.fields = fields
field_def = self.fields[-1]
self.record_length = field_def[-1]
self.scope_set = scope_set
def __iter__(self):
def lots():
n = 1
while True:
yield n, n
n += 1
if self.scope_set:
return XFlatFileIterator(self, iter(self.scope_set))
else:
return XFlatFileIterator(self, lots())
I will try this:
def __iter__(self):
def lots():
n = 1
while True:
yield n, n
n += 1
if self.scope_set:
it = iter(self.scope_set)
for _e, scope in it:
rec = self.element_at(scope)
if rec is None:
return
yield rec, scope
else:
it = lots()
for _e, scope in it:
rec = self.element_at(scope)
if rec is None:
return
yield rec, scope
Tests are green. There appears to be some duplication. Let’s fix that.
def __iter__(self):
def lots():
n = 1
while True:
yield n, n
n += 1
it = iter(self.scope_set) if self.scope_set else lots()
for _e, scope in it:
rec = self.element_at(scope)
if rec is None:
return
yield rec, scope
Green. Commit: convert iterator objects to generator functions.
Reflection
So. I think that all the XImplementation objects now use generators to produce their records. The general outlook is that for small collections, generators are slower, but they are faster for larger collections. And for small collections, they’re fast enough that for our purposes, it surely doesn’t matter.
If someone were to ask me “so what?”, I’m not sure how I’d answer. I think we can see some places where we might shortcut processing some information, but all the sets still materialize all their data.
The only big set we have, the XFlatFile, already used an incremental iterator:
class XFlatFileIterator:
def __init__(self, flat_file, generator):
self.file = flat_file
self.scope_gen = generator
def __iter__(self):
return self
def __next__(self):
_element, scope = next(self.scope_gen)
element_tuple = (rec := self.file.element_at(scope), scope)
if rec is None:
raise StopIteration
else:
return element_tuple
So flat files, the only big and slow set, just read as many records as were actually used anyway. And the others, XFrozen, XFlat, and XTuple, contain all their data upon creation. I’m not sure what the next step should be. I’m leaning toward working with select
to return something more dynamic than a thing that creates all the elements as now happens:
class XSet:
def select(self, cond) -> Self:
tuples = list((e, s) for e, s in self if cond(e, s))
return XSet.from_tuples(tuples)
I am hopeful that we can get from where we are to a situation where sets are streamed rather than completely materialized, without major revisions. I am open to the possibility of revision, however, if that’s what it takes.
Still … this feels significant. I hope my readers, if any, are enjoying the ride.
See you next time!