FAFO on GitHub

It’s 0415 and I am thinking about buffering. I create an XSet with a billion records in it. Python does not sneeze.

It has become clear that I’m not going back to sleep, so let’s do some work.

The basic issue goes something like this:

We only have one kind of file-based XSet. It models a set of fixed-length text records as a set of sets each of which has the same scope set. The scope set will be the field names of the file, and each corresponding element will consist of the corresponding characters in a fixed-length string of text from the file.

The set model is a set of sets, each with the same scopes, with string scopes and string elements. The system does not care whether elements and scopes are strings or numbers: these just happen to be strings because we’re modeling a file.

This scheme is currently supported by the XFlatFile XImplementation, which holds a file name and can stream over fixed-length records in a file. The elements (records) are assumed to be textual data. The XFlatFile includes a symbol table, with “field” names and lengths.

When the set is iterated, the scopes run from 1 to however many “records” are in the file, and each element is produced as an XSet implemented by XFlat, which holds one string sliced from the XFlatFile’s file, producing its elements using the same symbol table.

In the system as it stands, set operations all essentially iterate over the elements of their input sets, considering one element at a time. In the case of XFlatFile-based XSets, we iterate the set by getting the net scope (record number) to be produced, then open the file, read that one record (seek-read), hen close the file.

To iterate a thousand-record file takes a thousand open-seek-read-close cycles. That would require a little more than a millisecond on my laptop.

However, I have devised a test that does three or four select operations on an XFlatFile set, then takes the resulting sets and intersects them three or four different ways, checking the cardinality of each result, coming down, finally, to one record. Works just perfectly.

Yesterday, the select operation was made to stream rather than create a concrete collection of records. If you’re only going to use the selection once, in further processing, this seems to make sense. However, the intersect operation is defined like this:

    def intersect(self, other):
        return XSet.from_tuples((e, s) for e, s in self if other.includes(e, s))

We stream the elements of self and ask whether said element is also in the other set, retaining it if it is. Yep, that’s intersect all right. So, not to put too fine a point on it, because all the selects stream, my test did 261,746 open-seek-read-close cycles. That required three seconds, which I finally noticed because my tests generally run in about 150 milliseconds. (For some reason, this morning, the test is requiring more like five seconds. I am not going to chase that.)

One quick experiment is to leave the file open. That reduces the time from five seconds to three, since all the individual reads are still done. Not particularly surprising.

The question is what to do.

An Experiment

Last night I coded a little experiment, by modifying XFlatFile to have a buffer of some number of records, and to read chunks of that size from the file. If the record requested was in the buffer, slice it out, if not, read another buffer, starting at the record requested. So it runs through the file in chunks of N records.

Just as you would expect, this reduced the I/O by roughly a factor of the buffer size in records. I threw away the code for that, since it was an experiment. The question now …

The Question Now …

The question now is, what should we really do about file-based sets, even if we only ever support, say, flat files and perhaps CSV files? Options include combinations of:

  • Leave files open. Might save 40 percent.

  • Read more than one record at a time. Saves time proportionate to the ratio between number of records per read and total in the set.

  • Create a separate buffering mechanism with s large a number of buffers as possible and do some kind of LRU thing. Could probably reduce any reasonably-sized operation to approximately the speed of a single read of the file.

  • Stop streaming selects, at least for XFlatFiles. We’d still open-seek-read-close, but just once per set.

  • Provide a set operation that makes a set concrete, and use it judiciously in weird situations like our current test.

None of these address what is arguably the biggest issue, the way the intersect works, essentially searching one of the sets in the operation for every element of the other. Options there include:

  • Check to see which side of the intersect is smaller and search that one while reading the other. (Asking a selection for its cardinality runs the selection again.)

  • Freeze the set that is searched, locking it into memory.

  • Sort the sets and do a merge to get the intersection. Note that this can be done for XFlatFile sets without regard to the fields: any sort-merge will do.

  • Do the intersection with some kind of b-tree type approach that processes each set only once.

We might want to revisit the streaming question, deciding when it is helpful and when it is not. At a guess, it is helpful if you’re only going to do it once, and less helpful each additional time you do it.

A Larger Question

What is the in-memory capacity of this program?

I just ran this test:

    def test_memory_capacity(self):
        XFlatFile.read_count = 0
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ee = XSet(ff)
        records = []
        for t in ee:
            records.append(t)
        many = []
        times = 1_000_000
        for copy in range(times):
            for t in records:
                many.append(t)
        assert len(many) == times*len(records)
        assert len(many) == 1_000_000_000
        many_set = XSet.n_tuple(many)
        assert many_set.cardinality() == 1_000_000_000

The record length of that file is 44 characters. The list many contains one billion tuples. Each tuple contains an XSet, with an XFlat implementation containing the symbol table and a unique 44 character record. So there are at least 44 billion characters being stored in the table many.

The final lines in that test create a billion-element n-tuple XSet. The test runs in 15 seconds.

What does this tell me about how to handle sets on files? It seems clear that almost anything we might want to do with this thing can be done in memory. However, we should surely do something to avoid crashing when, inevitably, we run out of memory. That will require some study and thinking.

One Tentative Conclusion

I think we should modify XFlatFile to read in the entire file, or at least, say, up to a million records. Above that we might refuse to load the file.

Fascinating.