FAFO on GitHub

I’m working on the idea of an XFlatFile that only reads a defined subset of the file. Am I working without a net? Not really.

Hello, friends!

Intro — a particular efficiency

Over the past couple of sessions, we have mentioned that one “promise” of XST is the possibility of performing certain operations more efficiently. An XST proponent would likely say that this is a result of the precision and mathematical character of XST, with its operations that all start with extended sets and produce extended sets, making a sort of algebra of sets. Since everything we produce is a set, everything we produce is able to be acted on with the functions we have in hand, rarely requiring any specialized code.

That’s the story, at least. We are here to implement XST, not necessarily to praise it, and certainly not to sell it.1

Anyway, we have recently been thinking about a vague but interesting possibility for optimization. Suppose that we had a large set, probably file-based. And suppose from time to time, we execute restrict operations on that set, returning records with certain specific desired field values. Such an operation requires a full scan of this large file-based set.

So suppose that instead of restricting from desired values to records, our restrict returned the record numbers of the matching records. Well, on the one hand, we wouldn’t yet have the answer we wanted … but on the other hand, getting that answer would now involve nothing more than randomly accessing those presumably few records.

And our sets are immutable. We could save some structure recording the sets and values involved, and when that query inevitably appeared again, we could find our set of record numbers and again read them randomly, at lower cost than repeating the search.

That would be good.

So that’s what we’re working on. Now shortly we’ll talk about what I did yesterday while you weren’t watching, and then see where we go from here. But fist I want to talk about working in the dark.

Speculation — that’s not design!

All of this, and what we talked about over the past few days, and what I’m about to describe are all just speculation. Surely they are not really designL we don’t really know how we’re going to do any of this. Where are our twenty-seven color glossy pictures with the circles and arrows and a paragraph on the back of each one explaining what each one is?

Shouldn’t we be doing more design? Shouldn’t we have a concrete plan and shouldn’t we be following that plan? My answers are maybe, no, and definitely no.

Now when you work, I recommend that you plan as much as you need, make your plans as concrete as you need, and follow them as carefully as you need. The tricky bit is “as you need”. The needs of the team here chez Ron are not your needs, they are mine. And I have spent over two decades programming with less and less up front planning, fewer and fewer concrete plans, and with less and less commitment to whatever plans I had.

And my programming is not under any kind of deadline.

That does change the “as you need”. Still, I’d bet that most teams could do well with somewhat less planning before programming, with someone less concrete plans, and with somewhat less adherence to the details of the plans they have.

The thing is this: all plans are speculative. We do not know the future, we do not remember the code precisely, we certainly do not remember the code we haven’t even written yet. So we will do best to hang a bit loose. Every team, every individual, gets to decide what that means to them.

Darkness — looking into it

I do not know how we’ll get the record numbers we need. I do not know how we’ll create a set that looks at a file and produces just the corresponding records. I do not know how we can represent saved record number sets, nor how we’ll recognize the ones we can reuse when we need them.

But I have ideas that I think are pretty good:

  1. We can write a restrict that gets record numbers, or we can use scope_set to get the numbers from an existing set;
  2. We have an object that reads all the records. It uses an increasing integer 1, 2, and so on. We can probably have one that works from some other set of integers;
  3. It seems clear that we can build some kind of set or table that records operation and parameters and result. Somehow. We’ve built a lot of tables in our day.
  4. All our operations on data come own to set operations with parameters. Should be possible to look in the table above and find whether the answer is there.

Vague? You betcha. Solid enough to give me confidence? Yes. And if I’m wrong? It seems that I’ll be wrong on details, not the overall possibility. And I’ll work, as much as I can, on the bit most likely to fail, so that if it all falls apart, it’ll be sooner than later.

I have over six decades of programming experience. If I had six years, or six months, or six weeks, I would need to do differently. And, I hope to all the gods that if I only had a little experience, I’d be facing smaller problems with lower risk, and that I’d have trusted colleagues to work with.

How much planning should those imaginary people do? I’m not sure, but I wonder: against this kind of problem, how good are the plans of a team with only six years of experience? Should we really require them to make solid plans and stick to them? Seems unlikely to me.

Every team, I guess, needs to plan as much as they think they need, or a little less, make those plans as concrete as they think they need, or a little less, and stick to their plan as much as they think they need — or, probably, quite a bit less.

Report — what I did yesterday

Recall that our XFlatFile has an iterator, XFlatFileIterator, that look(ed) like this:

class XFlatFileIterator:
    def __init__(self, flat_file):
        self.file = flat_file
        self.scope = 1

    def __iter__(self):
        return self

    def __next__(self):
        element_tuple = (rec := self.file.element_at(self.scope), self.scope)
        self.scope += 1
        if rec is None:
            raise StopIteration
        else:
            return element_tuple

That just starts at scope 1 (because ordered n-tuples start at 1) and keeps incrementing and reading the next record every time someone calls next.

This morning, it looks like this.

class XFlatFileIterator:
    def __init__(self, flat_file):
        def lots():
            n = 1
            while True:
                yield n
                n += 1
        self.file = flat_file
        self.scope_gen = lots()

    def __iter__(self):
        return self

    def __next__(self):
        scope = next(self.scope_gen)
        element_tuple = (rec := self.file.element_at(scope), scope)
        if rec is None:
            raise StopIteration
        else:
            return element_tuple

What’s with lots, and what else is different?

The function lots:

        def lots():
            n = 1
            while True:
                yield n
                n += 1

That’s a “generator function” and in use it will produce sequential integers, starting with 1, until hell freezes over.

We save it in the member variable scope_gen. And then we refer to it in __next__, asking it for the next scope to look for (1, 2, …) and then attempt the read just as we used to.

This passes all the tests. I do suspect that I can write a test that it will not pass, but I am also hopeful that I can fix it. If not, well, I’m not sure what we’ll do. Probably not give up. I find iterators weird and it may take a few tries.

Why did you do this? It was working fine.

Yes, it was working fine. I did it because I’m thinking of a generator kind of thing that produces, not the integers from 1 to whenever, but just the integers corresponding to the records we want. So I’m working toward being able to pass in other kinds of generators or iterators, to read a few records from a large flat file-based set.

Can you break it?

I think a generator function can only be used once. So I think we can write a test that will fail.

We have this test:

    def test_x_flatfile(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        assert ff.record_length == 44
        ff_set = XSet(ff)
        count = 0
        record_number_sum = 0
        for record, record_number in ff_set:
            count += 1
            record_number_sum += record_number
        assert count == 1000
        assert record_number_sum == 500500

I suspect that if we try to iterate twice, it’ll fail. New test:

    def test_iterate_twice(self):
        path = '~/Desktop/job_db'
        fields = XFlat.fields(('last', 12, 'first', 12, 'job', 12, 'pay', 8))
        ff = XFlatFile(path, fields)
        assert ff.record_length == 44
        ff_set = XSet(ff)
        count = 0
        for _e, _s in ff_set:
            count += 1
        assert count == 1000
        count = 0
        for _e, _s in ff_set:
            count += 1
        assert count == 1000

The test passes! I am surprised. But let’s think: we produce a new iterator on each iteration of the XSet:

class XSet:
    def __iter__(self):
        return self.implementation.__iter__()

class XFlatFile(XImplementation):
    def __iter__(self):
        return XFlatFileIterator(self)

We never reuse the iterator! Perfect! There’s one concern we can set aside.

Now let’s see if we can write a test where we use a different way of generating the record numbers. We’ll proceed in tiny steps, first using a large ball-peen hammer and then smaller ones. First commit: test can iterate twice.

Summary

This is a good breaking point. What follows in the next article was written contemporaneously and at the same time as what is above. Well, all in one session anyway. In the next article we’ll accomplish what we’ve been aiming at, the ability to read a subset of the records of a flat file, controlled by a scope set of integer record numbers.

With what’s above, we were just shining a bit of light right around us, learning a bit about how to plug in a generator, and guessing as to where it might go. Our guess was pretty good as you’ll see.

Here, the point is, if we have one, that small steps and a reasonable but tentative plan for the next few steps can work out just fine. We’ll see that happen next time.

What does that mean for you, my reader, if you exist? You get to decide. I’d recommend that you experiment with flexible plans and see what happens. If they always work out, perhaps dial it back a bit more. But it’s all up to you.

See you next time for the thrilling outcome of our attempt to read random records!



  1. That said, the truth is that I think XST is an approach that has not received the attention it deserved. The reasons for that are lost in time, but have, in part, to do with the way it has been presented to the potential market. Anyway, we’re here now, not back then.