FAFO 4

Too far is far enough. Not quite too far isn’t.

As of yesterday afternoon’s short note, I have a “database” folder containing 1000 tagged files, with cryptic file names like t-20240129083200.010Z_author-ron_student-alice_topic-code.curry. I propose to rewirte it this morning, for the not very good reason that I want to put the file name in the file, just in case we ever look at it.

Here’s the test that writes the folder:

    def test_make_database(self):
        # assert False
        database = expanduser("~/programming/database")
        if exists(database):
            print("not writing")
        else:
            print("writing")
            names = self.make_filenames()
            os.mkdir(database)
            for name in names:
                lorem = TextLorem(srange=(7,10), prange=(4, 8), trange=(5, 9))
                # srange = number of words in sentence
                # prange = number of sentences in para
                # trangs = number of paras in text
                doc = lorem.text()
                full_name = f"{database}/{name}"
                with open(full_name, "w") as db_file:
                    db_file.writelines(doc)
        # assert False

The assert at the front is there to stop the test from running while I type, because it only writes the file if it doesn’t exist, so I’ve found it useful to take more control of when it runs. The assert at the end is there to let me check results if I want to. That is my standard trick for a test written so that I can observe the results rather than check them with code.

I think I want the elapsed time.

    def test_make_database(self):
        # assert False
        start = datetime.now()
        database = expanduser("~/programming/database")
        if exists(database):
            print("not writing")
        else:
            print("writing")
            names = self.make_filenames()
            os.mkdir(database)
            for name in names:
                lorem = TextLorem(srange=(7,10), prange=(4, 8), trange=(5, 9))
                # srange = number of words in sentence
                # prange = number of sentences in para
                # trangs = number of paras in text
                doc = name + "\n" + lorem.text()
                full_name = f"{database}/{name}"
                with open(full_name, "w") as db_file:
                    db_file.writelines(doc)
        elapsed = datetime.now() - start
        print(elapsed)
        assert False

Run time: 0:00:00.279751

Quarter of a second to write 1000 files of about 3K each. That’ll do, pig.

OK, commit: tweaking DB contents. timing. < 0.3 sec.

Reflection

Tonight, Tuesday, is FGNO¹, so I want to be ready to present what I have to the gang. I think there’s just one more important thing to do, and that will be to use some “user-supplied” tags to select the file(s) that match that set of tags. The main things I want from that are just two: it works, and it takes very little time. We’ll drive that with a test.

But I also want my own learning organized, and to be prepared to draw conclusions. Recall that the sole point of this weird idea, really, was to simplify the document / tag storage issue to the point that we could stop thinking about what database to use and instead think about what kinds of storing and retrieving logic we want. So I need to summarize that a bit.

Let’s do some retrieval.

Retrieval

What is the problem? The problem is, given some name-value tags, determine the names of all the files matching those tags. Now recall what the file names look like:

t-20240129083200.010Z_author-ron_student-alice_topic-code.curry

Arguably the simplest thing here would be a simple string search. If that’s the case, have I gone too far with my experiment with tag sets? Quite possibly yes. Is that bad? Quite probably no.

Let’s do some tests. At this instant I don’t even know how to get the names of the files in a folder.

class TestDatabase:
    def test_listdir(self):
        database = expanduser("~/programming/database")
        list = os.listdir(database)
        print(list)
        assert False

That works as expected, so I recast it thus:

class TestDatabase:
    def test_listdir(self):
        database = expanduser("~/programming/database")
        all_files = os.listdir(database)
        assert len(all_files) == 1000

I think I’m working toward a small class named Database, or maybe DocumentBase, but probably not DBase.

Let’s do some simple string searches.

    def test_find_hill_math(self):
        all_names = self.get_all_filenames()
        hill = [name for name in all_names if "author-geepaw" in name]
        math_hill = [name for name in hill if "topic-math" in name]
        hector = [name for name in math_hill if "student-hector" in name]
        assert len(hill) == 100
        assert len(math_hill) == 10
        assert len(hector) == 1

    def get_all_filenames(self):
        database = expanduser("~/programming/database")
        all_files = os.listdir(database)
        return all_files

Grab the names, select author, topic, and student. (There is an issue here.) By construction, there are ten of each element so we start with 1000, winnow down to 100, 10, and 1.

How long does this take, counting the time to read the list? Let’s find out.

    def test_selections_per_second(self):
        start = datetime.now()
        for i in range(1500):
            all_names = self.get_all_filenames()
            hill = [name for name in all_names if "author-geepaw" in name]
            math_hill = [name for name in hill if "topic-math" in name]
            hector = [name for name in math_hill if "student-hector" in name]
        elapsed = datetime.now() - start
        print(elapsed, elapsed.seconds, elapsed.microseconds)
        seconds = elapsed.seconds + elapsed.microseconds/1000000.0
        assert seconds < 1.0

I left the test at 1500 selections in less than a second. It can actually do 1800 but not 2000.

So if 2000 students all hit enter at once, our response time will be about one second. Close enough for now.

I am inclined to stop here, sum up, and then think a bit before our Zoom meeting.

Summary

It seems semi clear that the main “information” here is that we can write 1000 weirdly named file in less than 0.3 seconds, and we can read from a directory and search 1000 file names at a rate of at least 1800 per second. So that could have been figured out in a morning, but we have four FAFO articles. We must have learned more than that.

Did we go too far?

I delved clear down into something approaching set theory, a trap into which I sometimes fall. But out of it, I have a start at an abstraction for a name+value tag, and some operations that can process them. Still, it may have been too far.

How far is too far?

Good question. In advance, we cannot know. In practice, we can time-box efforts like this. I had a built-in time box of “ready with something by FGNO”, and anyway I work i short sessions, so I can’t really fall too deeply into a rat hole. But I would argue that we can’t know how far is too far unless we go too far. Otherwise, we only know that we haven’t gone too far, but we don’t know if we have gone far enough.

Always done.

This is why we work in small time periods, and why we work to commit our code many times during those periods. In a sends, we are always done. We have always taken the best small step toward our goal that we were able to think of. Some of those steps, in retrospect, won’t be the best, but retrospect is easy. No points for that, but do try to remember for next time.

Not far enough.

There is one notion that is coming up for me. I’m starting to wonder about updating. Suppose we wanted to add a new student, Alex. We would need to identify the items they should be signed up for, each of which would have lots of tags. Unlike our example tests, we’d have a number of documents with lots of students attached to each one. And for each of those, we’d want to add the tag “student-alex”. Should we remove the old records that didn’t have alex? The new ones would have newer timestamps, and presumably everything we would generally do would return the newest timestamp (in the normal case: doubtless there’d be ways to get older versions).

But if we did that, we’d write some large number of copies of whole documents, unchanged, back to the database. There’d be all the ones without Alex, and then the same number with ALex, each pair containing the same text.

So possibly, we want a two-level store, with each version of a given document saved in one collection, time-stamped and given some single document name, and then a second level of store that associates lists of tags with documents.

I don’t know the actual application, though I know enough to imagine parts of it. “Enough to be dangerous” we call that. But what I’d want to do, over time, is at least two things:

Store things so that we can represent any state we need to represent;
Define information-related notions, data and behavior, independent of the storage.

There may be work to be done on each of those, and certainly there is work to be done on the second. I’d want to begin to figure out sets of tags, configurations that represent the state of the curriculum as a set of topics, and as a set of assignments to students, and as a particular student’s status along their path through the assignments. We have a lot of modules. The 2024 Sophomore Cohort works on these modules. Alex has completed these five and can select as their next task, any of these three of the remaining seventeen.

What are the datasets and configurations that represent this situation? How do we represent the situation, how do we manipulate it? We have not gone far enough in working on those issues.

Too far, and not far enough.

There’s nothing about what we’ve done here that requires files, other than today’s timing experiment. We could do just as well, for now, with collections in memory. It’s the abstraction of storage and behavior that we need to work out. So we have gone too far … and not far enough.

Fascinating. I look forward to this evening’s chat.

And I’ll hope to see you next time, when I’ll report on what happened, and decide what to do next.

Friday Geeks Night Out. Held every Tuesday evening. As one does. ↩