Read 'em In

FAFO on GitHub

OK, new rule, we’re going to read XFlatFile sets into memory. Should we refactor to this new result, or TDD a new class? Let’s refactor. Works quite well!

After the oh-dark-thirty article, I returned to bed and caught up on my sleep. I’m conscious again now, and I guess the thing to do is to implement this morning’s decision to just read in all the data for XFlatFile instances. Since we have some nifty tests, I think I’ll just refactor to make this work. In a spirit of cooperation, however, I’ll begin with a new test to verify that the data has been read in.

I reckon we’ll have to compute the number of records and save it, since it’s presently detected by reading the end of file.

    def test_full_read(self):
        new_names = XSet.from_tuples((('first', 'first_name'), ('last', 'last_name')))
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        assert len(ff.buffer) == 44000

I think this should be right. Test is failing. No buffer. We code:

    def __init__(self, file_path, fields, scope_set=None):
        self.file_path = file_path
        self.full_file_path = expanduser(file_path)
        self.fields = fields
        field_def = self.fields[-1]
        self.record_length = field_def[-1]
        self.scope_set = scope_set
        with open(self.full_file_path, "r") as f:
            self.buffer = f.read()

Green. Commit: XFlatFile now reads in entire file upon creation.

Extend the test to get the cardinality and cache it.

    def test_full_read(self):
        new_names = XSet.from_tuples((('first', 'first_name'), ('last', 'last_name')))
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        assert len(ff.buffer) == 44000
        assert len(ff) == 1000

This passes already, because of this:

    def __len__(self):
        if self.scope_set is not None:
            return len(self.scope_set)
        return self.file_length_in_records()

    def file_length_in_records(self):
        file_length = stat(self.full_file_path).st_size
        return int(file_length / self.record_length)

Should we cache this?

Interruption: My brain interrupts me. The code above reflects the fact that an XFlatFile can have a scope set, and if it has one, it is treated as if it contains only the records indicated by the scope set. That makes me think that re-scoping must be creating a new XFlatFile instance, and that instance will reread the file. Let’s check that.

    def re_scope(self, re_scoping_set):
        if self.scope_set is not None:
            re_scoping_set = self.scope_set.re_scope(re_scoping_set)
        re_scoping_set = self.validate_scope_set(re_scoping_set)
        if len(re_scoping_set) == 0:
            return XSet.null
        new_impl = self.__class__(self.full_file_path, self.fields, re_scoping_set)
        return XSet(new_impl)

Yes. We’ll want to change that. Possibly just provide the buffer as a parameter? We’ll table the issue for a moment.

Too many balls in the air. I write a card:

len asks for file stat
limit reads via len?
rescope reads file again

Let’s change the implementation of len first.

I’m waffling. Right now we are reading in the entire file, no limits. What about files that are too large? Might we read them partially? I decide no, not now. We can do this:

    def __len__(self):
        if self.scope_set is not None:
            return len(self.scope_set)
        return int(len(self.buffer) / self.record_length)

Now we can remove the file_length_in_records method as unused. Ah, but it isn’t. Instead, we’ll change it and use it.

    def __len__(self):
        if self.scope_set is not None:
            return len(self.scope_set)
        return self.file_length_in_records()

    def file_length_in_records(self):
        return int(len(self.buffer) / self.record_length)

Green. Commit: compute len using buffer size.

We’re still opening the file and reading a record on every step. Time to fix that.

    def __iter__(self):
        def lots():
            n = 1
            while True:
                yield n, n
                n += 1

        it = iter(self.scope_set) if self.scope_set else lots()
        for old_scope, new_scope in it:
            rec = self.element_at(old_scope)
            if rec is None:
                return
            yield rec, new_scope

    def element_at(self, scope):
        if not isinstance(scope, int) or scope < 1:
            return None
        rec = self.get_record(scope - 1)
        if rec == '':
            return None
        return XSet(XFlat(self.fields, rec))

    def get_record(self, index):
        seek_address = index*self.record_length
        with open(self.full_file_path, "r") as f:
            self.__class__.read_count += 1
            f.seek(seek_address)
            rec = f.read(self.record_length)
        return rec

We happen to know that a read off the end returns an empty string. Let’s emulate that here:

    def get_record(self, index):
        seek_address = index*self.record_length
        if seek_address > len(self.buffer):
            return ''
        else:
            return self.buffer[seek_address:seek_address + self.record_length]

That breaks a test: this one:

    def test_do_not_waste_memory(self):
        XFlatFile.read_count = 0
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ee = XSet(ff)

        def sel(e, s):
            return (e.includes('jeffries', 'last') and
                    e.includes('ron', 'first') and
                    e.includes('coder', 'job') and
                    e.includes('12000', 'pay'))

        final = ee.select(sel)
        assert final.cardinality() == 1
        employee, scope = final.pop()
        has_jeffries = employee.includes('jeffries', 'last')
        assert has_jeffries
        assert employee.includes('ron', 'first')
        assert employee.includes('coder', 'job')
        assert employee.includes('12000', 'pay')
        assert XFlatFile.read_count == 1017

We no longer update read_count. I suppose we could set it to one but let’s remove it. We are green, although there are tests marked to skip. Commit: XFlatFile just slices out the record after reading entire file in.

Now I’m left with “rescope reads file again”. We can save that read by providing the file buffer to XFlatFile. If we do, we’ll skip the read. Change re_scope:

    def re_scope(self, re_scoping_set):
        if self.scope_set is not None:
            re_scoping_set = self.scope_set.re_scope(re_scoping_set)
        re_scoping_set = self.validate_scope_set(re_scoping_set)
        if len(re_scoping_set) == 0:
            return XSet.null
        new_impl = self.__class__(self.full_file_path, self.fields, re_scoping_set, self.buffer)
        return XSet(new_impl)

This breaks a couple of tests. We aren’t done yet, we knew that.

    def __init__(self, file_path, fields, scope_set=None, buffer=None):
        self.file_path = file_path
        self.full_file_path = expanduser(file_path)
        self.fields = fields
        field_def = self.fields[-1]
        self.record_length = field_def[-1]
        self.scope_set = scope_set
        if buffer is None:
            with open(self.full_file_path, "r") as f:
                self.buffer = f.read()
        else:
            self.buffer = buffer

Green. Commit: re-scope passes existing buffer to new XFlatFile instance.

I think I’d like to see what happens with that test that does all the intersects and such.

Expected :261746
Actual   :0

Another read count thing. Which reminds me, I forgot to remove it. References gone, remove it. Green, commit: remove read_count

Remember that horrible test with all the intersections? With it enabled, my 159 tests are requiring 420 milliseconds. With it disabled, 158 tests run in … wait for it … 10 milliseconds.

Reflection

Prior to these changes, the tests were requiring about 150 milliseconds. So about 140 of those were consumed by file reads, even just the “few” thousand we were doing. I recall thinking that they were not that significant, but they did add up to over a tenth of a second even without the horrible not to say hideous intersection test.

I do not like seeing tests skipped in the test results. Currently there are two, the one with all the intersections, and the one that creates a billion element set. With your kind permission, I will comment them out, so that the green bar is green and not a dull orange color.

Green. Commit: comment out tests to avoid skipped rap.

There’s probably some clever way to isolate that kind of test, ones that we might run on special occasions but that we do not run all the time. Especially the billion record one. I do not know that trick. Commenting will work.

Summary

With a few simple tests, we have refactored our XFlatFile object to read in the entire file in one go, which results in reducing the number of I/O operations by a factor of 1000 in our simplest current tests, which happen to use a one-thousand-record set. And by a factor of 260,000 in our most complex file-based test.

As proud as we might be of our streaming select method, I think it deserves a rethinking. Its purpose, if I recall, was to avoid making physical copies of sets that were just processed once, instead just returning the individual elements and scopes and not accumulating them in a concrete set first.

The idea was to save memory. There is possibly also a savings of time in some cases, but we have no evidence of that. We have seen, however, that it is possible to build up pipelines of operations that will re-execute the selection, re-processing the larger set to create the subset. With the old scheme, the selection created a concrete collection and subsequent uses of that set just returned those records, not skipping over the unselected ones.

Options include:

Leave it, it’s mostly harmless although sometimes inefficient;
Revert to always creating a concrete set;
Provide two select operations, one virtual and one concrete;
Provide an operation to force a set to concrete;

Up until now, we have added no operations whose sole purpose is to allow a programmer decision to make a choice of operations based on efficiency. Well, except for all of them, since it is possible, and desirable, to devise a sequence of operations that produces the desired result efficiently. So maybe options 3 and 4 aren’t really that far out of line.

More important, I think, might be the speed of operations like intersection, union, and symmetric difference, which all involve ascertaining whether one set includes an element from another set. I suspect that these operations are best done on sorted elements. We might find a b-tree approach to be useful in some cases: I’ll have to think about that.

Lots that we could do. But today we did a thing that’s clearly “good for now” by reading in whole sets from files and processing in memory. Fun!

See you next time!