Journeys of Happy Discovery

FAFO on GitHub

To process flat files, we want to avoid leaving a file open, and we don’t want to open it a zillion times. Do we have to invent buffering? Perhaps not.

When I sat down at the computer here, I planned to invent buffering. And by the time I wrote the first sentence of the blurb, I could hear a mysterious voice in my head, saying “A performance measurement having shown …”. So, yeah, maybe we don’t need to invent buffering today. We’ll find out. We do need a file, however, because we want to be able to process flat files.

Let’s drive one out. First, I want to make sure that I can create a suitably padded line:

    def test_padded_line(self):
        author = "ron"
        topic = "math"
        student = "dorothy"
        line = f'{author:12s}{topic:12s}{student:12s}'
        assert len(line) == 36
        assert line == 'ron         math        dorothy     '

That passes nicely. I do anticipate an amusing time creating these records on the fly, but I know Python has ways and means. For now, we can readily create a nice flat file with what we’ve got.

    def test_make_flat_file(self):
        path = expanduser('~/Desktop/job_db')
        if isfile(path):
            return
        lasts = ["jeffries", "wake", "hill", "hendrickson", "iam"]
        firsts = ["ron", "bill", "geepaw", "chet", "sam", "amy", "janet", "susan", "beyonce", "taylor"]
        jobs = ['serf', 'boss', 'clerk', 'coder', 'architect']
        pays = [9000, 10000, 11000, 12000]
        with open(path, "w") as db:
            for l, f, j, p in product(lasts, firsts, jobs, pays):
                db.write(f'{l:12s}{f:12s}{j:12s}{p:8d}')

This test creates job_db on my desktop with 1000 records in it, if 5*10*5*4 is 1000, which I am assured it is. The file contains only the data, no newlines, and the pay is right-justified just for fun. The test only writes the file if it isn’t there.

Breathe

With that in place, we can take a breath, peel our morning banana, which for some reason I have not started yet, and think about our next steps.

We’re working toward an implementation of an XSet that appears to contain records that look like this:

{ hill^last, bill^first, boss^job, 11000^pay }

We “know” how to produce a set like that, with XFlat as its implementation, although we could almost as readily create a standard XFrozen version. I think we want a new class, XFlatFile, that can process one of these files. Implementing its required behaviors will be … interesting.

Let’s start TDDing. I think the XFlatFile implementation object only needs the path to the file and the field definitions, so this seems a decent start:

    def test_x_flatfile(self):
        path = expanduser('~/Desktop/job_db')
        fields = XFlat.fields(('last', 12, 'first, 12', 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)

This will drive out the class. After allowing creation of the default abstract methods, it looks like this:

class XFlatFile(XImplementation):
    def __init__(self, file, fields):
        self.file = file
        self.fields = fields

    def __contains__(self, item):
        pass

    def __iter__(self):
        pass

    def __hash__(self):
        pass

    def __repr__(self):
        pass

The test is passing. No surprise, it doesn’t do anything yet.

Let’s see. We want to iterate the set and receive a series of XSets with XFlat implementations, one for each record.

I think I should have made a smaller file. We’ll not worry about that for now.

In fact, we’ll code our iterator to assume for a while that there is only one record in the file.

    def test_x_flatfile(self):
        path = expanduser('~/Desktop/job_db')
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        ff_set = XSet(())
        ff_set.implementation = ff
        for record in ff_set:
            assert record.includes('serf', 'job')

Once we make this work, it will fail quickly: only the first four records are ‘serf’, then it switches to ‘boss’.

Test fails. I assume it’s lack of an iterator. Right:

>       for record in ff_set:
E       TypeError: iter() returned non-iterator of type 'NoneType'

Write one. How do we do this? I think we actually have to build an iterator class. I pull the following out of empty space and/or the internet:

class XFlatFileIterator:
    def __init__(self, flat_file):
        self.file = flat_file
        self.index = 0

    def __iter__(self):
        return self

    def __next__(self):
        if self.index < 1:
            rec = self.file.get_record(self.index)
            flat = XFlat(self.file.fields, rec)
            flat_set = XSet(())
            flat_set.implementation = flat
            return flat
        else:
            raise StopIteration

I think that if I had a record, this would almost work. Let’s try this:

    def __next__(self):
        if self.index < 1:
            # rec = self.file.get_record(self.index)
            rec = 'jeffries    ron         serf            9000'
            flat = XFlat(self.file.fields, rec)
            flat_set = XSet(())
            flat_set.implementation = flat
            return flat
        else:
            raise StopIteration

I just copied and pasted in one of the records from the job_db. Test still fails. Why?

        for record in ff_set:
>           assert record.includes('serf', 'job')
E           AttributeError: 'XFlat' object has no attribute 'includes'

I returned the flat not the flat_set. Duh. With that fixed, the test loops forever (not failing). I didn’t increment index.

    def __next__(self):
        if self.index < 1:
            # rec = self.file.get_record(self.index)
            rec = 'jeffries    ron         serf            9000'
            flat = XFlat(self.file.fields, rec)
            flat_set = XSet(())
            flat_set.implementation = flat
            self.index += 1
            return flat_set
        else:
            raise StopIteration

My test passes! Whew!

Reflection

I pulled a lot of code out of the air there, and when I got to needing the get_record, it was just too much. I needed a smaller step, so just returning one record was a nice trick to provide that smaller step.

We can commit this: our tests are green and our implementation is weak but passes all its tests. Commit: initial XFlatFile and supporting cast.

Now we can settle down and think about what we’ve got and what we need.

First of all, it is well past time to change XSet so that we can directly create sets with other than XFrozen implementations. Our tests are all patching in the implementation. That just won’t do. That change isn’t really on the current path, so to undertake it, I’d have to flush my buffers on the flat stuff and refocus. Then I’d have to come back to the flat stuff. I think we’ll defer this change until after we get further on the flats.

As for the flats, if we could read the n-th record of the file, we’d be golden. I think what we’ll do about that is to open the file on every get_record call, read one record, close the file. It’ll be slow, but how slow will it be? If it’s good enough for now … we’ll leave it that way until it isn’t.

We’ll need the record length, which is easy to get: I think it’s the last number in the symbol table. We’ll open the file for random access, seek to record number times length, read length characters, return what we got.

Let’s try it.

get_record

class XFlatFile(XImplementation):
    def __init__(self, file, fields):
        self.file = file
        self.fields = fields
        field_def = self.fields[-1]
        self.record_length = field_def[-1]

    def get_record(self, index):
        seek_address = index*self.record_length
        with open(self.file, "r") as f:
            f.seek(seek_address)
            rec = f.read(self.record_length)
        return rec

I kind of expect that to work. Change the implementation of the next:

    def __next__(self):
        if self.index < 1:
            rec = self.file.get_record(self.index)
            # rec = 'jeffries    ron         serf            9000'
            flat = XFlat(self.file.fields, rec)
            flat_set = XSet(())
            flat_set.implementation = flat
            self.index += 1
            return flat_set
        else:
            raise StopIteration

The test passes! We read the correct record. Let’s change the test so that it will fail and give us more info.

    def test_x_flatfile(self):
        path = expanduser('~/Desktop/job_db')
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        assert ff.record_length == 44
        ff_set = XSet(())
        ff_set.implementation = ff
        for record in ff_set:
            assert record.includes('serf', 'job')
            assert record.includes('wake', 'last')

Fails as advertised, saying:

>           assert record.includes('wake', 'last')
E           AssertionError: assert False
E            +  where False = <bound method XSet.includes of 
XSet(XFlat('jeffries    ron         serf            9000'))>
('wake', 'last')
E            +    where <bound method XSet.includes of 
XSet(XFlat('jeffries    ron         serf            9000'))> 
= XSet(XFlat('jeffries    ron         serf            9000')).includes

The message is horrid, but clearly it sees the right record.

We need to commit, and to reflect, and to deal with end of file. Commit: can read and check one record.

Reflection

Why is it OK to commit these fractional implementations of XFlatFile and its friends? Simple: the classes are not yet used in production code, so it is safe to ship production code that contains them.

As for end of file, we need to see what is returned from a read past the end of a file. Docs tell me it will return an empty string.

I think we’ll improve our test and see about reading the whole file.

    def test_x_flatfile(self):
        path = expanduser('~/Desktop/job_db')
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        assert ff.record_length == 44
        ff_set = XSet(())
        ff_set.implementation = ff
        count = 0
        for record in ff_set:
            count += 1
        assert count == 1000

And …

    def __next__(self):
        rec = self.file.get_record(self.index)
        if rec == '':
            raise StopIteration
        else:
            self.index += 1
            flat = XFlat(self.file.fields, rec)
            flat_set = XSet(())
            flat_set.implementation = flat
            return flat_set

The test passes. We have read 1000 records and converted them to XFlat XSets. The tests run for maybe 2 seconds, probably less.

Test are green. Commit: XFlatFile iterates entire file.

Reflection

I am tiring. Since we’re green, we can stop any time. Let’s see where we stand. We’ll do a quick scan of the code to see what we can see. I am confident that there is plenty to see.

class XFlatFileIterator:
    def __init__(self, flat_file):
        self.file = flat_file
        self.index = 0

    def __iter__(self):
        return self

    def __next__(self):
        rec = self.file.get_record(self.index)
        if rec == '':
            raise StopIteration
        else:
            self.index += 1
            flat = XFlat(self.file.fields, rec)
            flat_set = XSet(())
            flat_set.implementation = flat
            return flat_set


class XFlatFile(XImplementation):
    def __init__(self, file, fields):
        self.file = file
        self.fields = fields
        field_def = self.fields[-1]
        self.record_length = field_def[-1]

    def __contains__(self, item):
        pass

    def __iter__(self):
        return XFlatFileIterator(self)

    def __hash__(self):
        pass

    def __repr__(self):
        pass

    def get_record(self, index):
        seek_address = index*self.record_length
        with open(self.file, "r") as f:
            f.seek(seek_address)
            rec = f.read(self.record_length)
        return rec

We have not yet implemented all the critical methods for XFlatFile. the most interesting will be __contains__, which certainly implies an iteration through the entire file, but arguably there could be a default higher-level version that creates all the records as XSets and deals with includes there. Maybe not.

What we have here, though we call it an XImplementation … maybe there’s another layer trying to appear, not a data structure but a storage one. I don’t think I can even explain this feeling clearly just now, but as written, XFlatFile returns flat records. They get converted to XFlat-style XSets by the code in the iterator.

I think that the XSet creation needs to be in XFlatFile, not in its iterator, which should probably be conditioned on some other kind of return, not the empty string. XFlatFile produces an XSet record and the iterator just returns it. I think that’s more in the proper spirit.

Then, perhaps, under all that, we would put our actual file accessing logic, including any buffering that we might one day do.

Oh! What about the scope of the record in the flat file? We’re really supposed to iterate on element, scope, not just element, which is what our test does now. That needs to be addressed.

OK. These observations actually give me a bit of comfort. I am not comfortable with this implementation, but now I know some things that need improvement, and when those are done, I expect to be better able to see whether there is more to be uncomfortable about. I can chalk the current discomfort up to the concrete issues above, leaving the more whifty ones to either dissipate or become more clear.

Summary

This went well. Writing the iterator, even with the help of the internet, was a big chunk, so I’m pleased that I decided to fake the file read and just return a single fixed record. That bridged the test gap, got us to green, and enabled us to commit, giving us a reasonably solid place to stand for our next move.

Instead of going in one big leap, we went in a couple of smaller steps. Always good.

But this code is a crock!

Yes, this code is a crock. It is woefully incomplete. Some of the things it does are done in places they shouldn’t be. It doesn’t implement the full required interface. How dare we commit it!

It’s not daring at all. The code passes its tests, we are on top of what it does and its flaws, and are fully ready to improve it next time.

But what if it’s all wrong?

Well, it isn’t all wrong, but if it were, we have a few tests that encapsulate some of our understanding of what it has to do, so worst case we get to remove these classes and write new ones. Git will take care of removing what we don’t want and adding what we do. There is absolutely no need to have everything perfect before we notch a step forward in our repo. All that is necessary is that all the tests run. (And, of course, it’s nice that there is no production code, untested, that’s going to use this thing we just invented, but there can’t be any now, and after we commit, the team will know not to use the thing until we do our little victory dance.)

It is absolutely true, and I freely declare it, that we were feeling our way forward, figuring out how to write out a record, figuring out how to read randomly, and so on. You would probably not need to figure out so much, because you’ve been reading and writing files all your life. If you had been here, we’d have paired and it would have gone even better than it did. But get real, the only bad thing that happened was that I forgot to increment the record number and the tests looped.

To me, the real lesson is that these small steps work amazingly well. Yes, of course we’re thinking and planning and imagining how things might go and how they might be done. But we don’t have to see it all: we just have to see an approximate next step.

It makes programming a journey of happy discovery rather than a dull slog. I love working this way!

See you next time!