How Many?
We have a long-form cardinality method. Let’s use the len function and require it as part of the implementations.
Hello, friends!
I implemented this, just because I needed it, without thinking about it much at all:
class XSet:
def cardinality(self):
count = 0
for _ignored in self:
count += 1
return count
Nothing good about that. Let’s require our sets to implement the len
function.
class XImplementation(ABC):
# @abstractmethod can be implemented if desired
def __contains__(self, item):
raise NotImplemented
@abstractmethod
def __iter__(self):
raise NotImplemented
@abstractmethod
def __hash__(self):
raise NotImplemented
@abstractmethod
def __len__(self):
raise NotImplemented
@abstractmethod
def __repr__(self):
raise NotImplemented
Will PyCharm tell me about my violations of this rule? It will. Also all my tests fail until I get most of the new methods implemented. I just used len(whatever_data) for most of them. To get things going, I did XFlatFile longhand.
I kept cardinality:
class XSet:
def cardinality(self):
return len(self)
def __len__(self):
return len(self.implementation)
Commit: all XImplementations can return len(), used optionally with cardinality.
Now let’s see about XFlatFile, which I just did this way:
class XFlatFile(XImplementation):
def __len__(self):
count = 0
for _i in self:
count += 1
return count
I think we should have a test for this. Ah, we have:
def test_waste_memory(self):
path = '~/Desktop/job_db'
fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
ff = XFlatFile(path, fields)
ee = XSet(ff)
assert ee.cardinality() == 1000
jeffries = ee.select(lambda e, s: e.includes('jeffries', 'last'))
assert jeffries.cardinality() == 200
...
So we can see if we can get the file length and use that.
class XFlatFile(XImplementation):
def __len__(self):
file_length = stat(expanduser(self.file_path)).st_size
return int(file_length / self.record_length)
This works. The code does not deal with anything quaint like other than plain 8-bit text files. But it should work well for those and they are all we have at the moment.
Commit: Get cardinality from file length.
I noticed that I am expanding user in more than one place. Let’s do that once and for all.
Here’s the current XFlatFile:
class XFlatFile(XImplementation):
def __init__(self, file_path, fields, scope_set=None):
self.file_path = file_path
self.full_file_path = expanduser(file_path)
self.fields = fields
field_def = self.fields[-1]
self.record_length = field_def[-1]
self.scope_set = scope_set
def __contains__(self, item):
de, ds = item
return de == self.element_at(ds)
def __iter__(self):
def lots():
n = 1
while True:
yield n, n
n += 1
it = iter(self.scope_set) if self.scope_set else lots()
for _e, scope in it:
rec = self.element_at(scope)
if rec is None:
return
yield rec, scope
def __hash__(self):
return hash((self.full_file_path, self.fields))
def __len__(self):
file_length = stat(self.full_file_path).st_size
return int(file_length / self.record_length)
def __repr__(self):
return f'XFlatFile({self.file_path})'
def get_record(self, index):
seek_address = index*self.record_length
with open(self.full_file_path, "r") as f:
f.seek(seek_address)
rec = f.read(self.record_length)
return rec
def element_at(self, scope):
if not isinstance(scope, int) or scope < 1:
return None
rec = self.get_record(scope - 1)
if rec == '':
return None
return XSet(XFlat(self.fields, rec))
def rename_contents(self, re_scoping_set):
new_names = []
for name, start, len in self.fields:
changed_name = name
for old, new in re_scoping_set:
if name == old:
changed_name = new
new_names.append((changed_name, start, len))
new_impl = self.__class__(self.full_file_path, new_names, self.scope_set)
return XSet(new_impl)
I kept the file_path
member so that __repr__
doesn’t display my true path, so that no one will ever find out that my Mac user name is fuzzy-caterpillar or whatever it is.
Commit: Expand file path just once.
Summary
Just a small improvement, but surely the cardinality / len operation on the XFlatFile is thousands of times faster. Most of the others, as well. 99 tests run in 143 ms currently. Not bad.
I think that a thing that might be interesting would be another XImplementation, XRelation, a set where all the records have the same shape, with named field names like the examples we’ve been using. But the names would be kept just once, in the top-level set, and the field values in a simple list (or XTuple?) and spun out with the proper field names upon iteration.
That would save mass quantities of storage and probably some processing time as well.
The XRelation type would be closed under a number of operations, if one were careful, such as project, restrict, and select.
Another thing I am tempted to try is some explicitly nested sets, such as a person record with a name element containing first and last, or a contact with address, email, phone. One might even consider a name element with varied contents, since some people’s names do not really parse into first / last.
Another area of importance would be general grouping, such as collecting all the employees into job groups. Grouping is particularly valuable when calculating statistics such as average pay by job title.
Oh, and here’s what may be a good one. We presently do select
by producing the entire selected set. What if, instead, we were to put a filter on top of the source set, with an iterator that just returns one record at a time? Would that be better in any important way?
Lots to do. I still rather wish I had an application in mind. Having fun tho.
Perhaps it’s time to draw some serious lessons from all this? I think there are some lurking in here somewhere.
See you next time!