FAFO on GitHub

A Pythonista experiment makes me think. Time permitting, I’ll build some files.

Last night, while somewhat watching the Lions do really well in the first half, I wrote the following code, in my iPad’s Pythonista appL

import pytest
from collections import namedtuple

def test_tagset_exists():
    ts = TagSet()
    
def test_add():
    ts = TagSet()
    ts.add_at("ron", "author")
    assert ts.has_at("ron", "author")
    
def test_subset():
    s1 = TagSet()
    s1.add_at("ron", "author")
    s2 = TagSet()
    s2.add_at("bill", "author")
    s2.add_at("ron", "author")
    
    assert s1.is_subset(s2)
    assert not s2.is_subset(s1)


Atom = namedtuple("Atom", ["value", "name"])


class TagSet:
    def __init__(self):
        self._cont = set()
        
    def add_at(self, value, name):
        self._cont.add(Atom(value, name))
        
    def has_at(self, v, n):
        return (v, n) in self._cont
        #return any([value == v and name == n for value, name in self._cont])
        
    def is_subset(self, s):
        #return all([s.has_at(v, n) for v, n in self._cont])
        return self._cont.issubset(s._cont)

I have a love/hate relationship with Pythonista. I fear that its creator has nearly abandoned it, as its forum no longer allows login, and there hasn’t been a new version for a while. But it’s an amazing python 3.10, running on an iPad, with quite a few libraries built in.

I have considerable trouble convincing it to run pytest without throwing an error, and more commonly, when I do things that should work, it collects the tests and then does not run them. I have a workaround that runs them ad then pops up an error saying that it found no tests.

Even worse, of course, ks that typing code with an iPad in one hand or on my lap is tedious. I really miss a real keyboard and miss PyCharm even more, but Pythonista is still a nice program and I get some good use out of it. (Yes, I know iPads have real keyboards, but it doesn’t work well in my TV room situation.)

Reflection (on the above code)

I’m a bit torn about what kind of abstractions to present. I started out thinking that there would be a TagSet, which would work like my Extended Set Theory stuff. XST represents sets with elements and “exponents”, for mathematical reasons, such as

{x1, joename}

In XST, the thing on the bottom is called “element” and the thing on the top os called “scope”. In the code above, I was thinking “name and value”, as those were the words I had in mind for tags in my experiment with the curriculum database.

What I learned from the little experiment above is interesting (to me, at least).

A Python tuple (a, b) compares equal if both a and b are equal. And a Python set of tuples works exactly as one would expect it to. Furthermore, Python sets include a number of useful methods, including difference, intersection, isdisjoint, issubset, symmetric_difference, and union. I have not timed these, and do not really plan to, but as they are built in, I suspect they include some decent support from functions written in C.

So … for operations in memory, I think we’ll find that sets of two-tuples will be quite nice. Generally, I think it’s better not to surface native types, so I’ll probably want to cover these objects with classes or something similar.

Python, meanwhile, has actually gone out of its way to make it easy to subset native classes like string, set, and tuple. When you do that, you gain the advantages of native types. Of course, wrapping isn’t all that much slower. Didn’t I figure out the other day that a Python method call is about 5 nanoseconds on my laptop?

I should mention this line in particular:

Atom = namedtuple("Atom", ["value", "name"])

I think that’s the last thing I did last night. The upshot of that code is that Atom becomes a subclass of tuple, and the first two elements of the tuple can be accessed by t.value and t.name—and also by t[0] and t[1]. With all the built-in tuple methods, whatever those may be.

Reflection (what it may mean)

So, with those thoughts brought into my mind by the code I wrote last night, what about our actual problem?

Well. Imagine a bunch of documents, each with a bunch of tags associated. Tags are pairs of name and value, both strings. In our curriculum program, we want some particular documents. Maybe we want all the ones with student=james and topic=python.

If we consider the tabs for the documents to be a set of tuples of name=value, then we want all the documents where the set

{ (james, student), (python, topic) }

is a subset of the document’s tags. That is, each of those tuples is also in the doc’s tuples. issubset is the whole thing.

What About the Files?

My original weird suggestion was to have the tags in the file name and the document in the file of that name. That has one slightly undesirable property, namely that the same document might appear under two completely different sets of tags, at least in principle. I think we don’t want that, because a common action for a curriculum creator will be to edit a document, intending to change it everywhere it appears. (Another common action will probably be to edit it intending to make a new version, like a second edition, and not to update the old ones. Maybe the creator is changing the exercises and wants existing students to keep the current ones. (In fact, paranoid creators will probably change the exercises for each cohort.))

I think we can assume that when an creator edits a document, the curriculum program has the document plus all its current keys. So I feel sure that we’ll have all the information we could possibly need to do whatever the creators might want to do. Some of those operations include:

  1. Add or remove tags from the document, intending to edit the entire situation. Perhaps we assigned the document to student=jones and meant student=jonas.
  2. Edit the document, intending that it should appear in edited form for all existing tags. Fixing a typo.
  3. Edit the document, intending it to appear only for some new set of tags. New edition.

These are “domain” questions and we certainly need to work out all the things creators can do, and the system operations that carry out those things.

One powerful technique, I feel sure, will be to associate a unique timestamp with each file. Time in milliseconds seems sufficient to me, if we are careful. We could write 1000 files per second if we waited for the next time tick for each write. Or, I don’t care, get time in microseconds. I think we will want a unique increasing key for each write. Maybe it’s just a growing integer?

I’m going to assume that it is an object that can come down to a guaranteed unique date-time. That will facilitate the very common case of updating a document, leaving all the user-defined tags the same: each time you save, you get the date and time automatically, and you can find last Thursday’s version readily.

Wow, 132 (long) lines of thinking. That’s a lot.

Do Something

I was thinking we should make a bunch of files in a folder. Maybe a thousand of them. With keys.

I propose to create files with tags separated by underscore and tag elements separated by minus sign. I think those will be legal everywhere. I propose to prefix each file, automatically, with the time stamp. Let’s assume three tags, author, topic, and student, each with ten elements to be defined. Our file names will look like this:

t-20240129092005.000Z_author-ron_topic-math_student-jack.curry

Let’s generate some file names.

    def test_make_file_names(self):
        authors = ["ron", "bill", "geepaw", "chet", "sam", "amy", "janet", "susan", "beyonce", "taylor"]
        assert len(authors) == 10
        topics = ["math", "code", "python", "lisp", "fortran", "ethics", "debugging", "security", "mobbing", "pizza"]
        assert len(topics) == 10
        students = ["alice", "bob", "charlie", "dorothy", "eliza", "fred", "geena", "hector", "ida", "justin"]
        assert len(students) == 10
        time = 0
        names = []
        for author in authors:
            for topic in topics:
                for student in students:
                    time_stamp = f"t-20240129083200.{time:03d}Z"
                    line = f"{time_stamp}_author-{author}_topic-{topic}_student-{student}.curry"
                    time += 1
                    names.append(line)
        assert names[0] == "t-20240129083200.000Z_author-ron_topic-math_student-alice.curry"

Green. Commit: test making file names.

OK, but we actually want to make file names from a tag set. Let’s do that as a method on TagSet, at least for now. I’ll just use the one from Pythonista for now.

Ah! I got nearly done and realized that I have a concern.

    def test_make_file_names(self):
        authors = ["ron", "bill", "geepaw", "chet", "sam", "amy", "janet", "susan", "beyonce", "taylor"]
        assert len(authors) == 10
        topics = ["math", "code", "python", "lisp", "fortran", "ethics", "debugging", "security", "mobbing", "pizza"]
        assert len(topics) == 10
        students = ["alice", "bob", "charlie", "dorothy", "eliza", "fred", "geena", "hector", "ida", "justin"]
        assert len(students) == 10
        time = 0
        names = []
        for author in authors:
            for topic in topics:
                for student in students:
                    time_stamp = f"t-20240129083200.{time:03d}Z"
                    ts = TagSet()
                    ts.add_at(time_stamp, "t")
                    ts.add_at(author, "author")
                    ts.add_at(topic, "topic")
                    ts.add_at(student, "student")
                    line = ts.get_file_name()
                    time += 1
                    names.append(line)
        assert names[0] == "t-20240129083200.000Z_author-ron_topic-math_student-alice.curry"

The TagSet is a set. So it will not necessarily produce the file name in that order. Do we care? Interesting question.

They’ll still be unique. They’ll just be terrible to read, where up until a moment ago, they were at least possible to read.

New rule: file name has timestamp on the front and all the other keys in alphabetic order name, value.

Change the test:

        assert names[0] == "t-20240129083200.000Z_author-ron_student-alice_topic-math.curry"

In writing get_file_name, I found a bug in the test, which is now:


~~~python
    def test_make_file_names(self):
        authors = ["ron", "bill", "geepaw", "chet", "sam", "amy", "janet", "susan", "beyonce", "taylor"]
        assert len(authors) == 10
        topics = ["math", "code", "python", "lisp", "fortran", "ethics", "debugging", "security", "mobbing", "pizza"]
        assert len(topics) == 10
        students = ["alice", "bob", "charlie", "dorothy", "eliza", "fred", "geena", "hector", "ida", "justin"]
        assert len(students) == 10
        time = 0
        names = []
        for author in authors:
            for topic in topics:
                for student in students:
                    time_stamp = f"20240129083200.{time:03d}Z"
                    ts = TagSet()
                    ts.add_at(time_stamp, "t")
                    ts.add_at(author, "author")
                    ts.add_at(topic, "topic")
                    ts.add_at(student, "student")
                    line = ts.get_file_name()
                    time += 1
                    names.append(line)
        assert names[0] == "t-20240129083200.000Z_author-ron_student-alice_topic-math.curry"

And get_file_name, working if not lovely, is this:

    def get_file_name(self):
        ts = next(filter(lambda pair: pair.name == "t", self._cont))
        remainder = self._cont.copy()
        remainder.discard(ts)
        tags = [pair for pair in remainder]
        tags.sort(key=lambda p: (p.name, p.value))
        tags.insert(0, ts)
        strings = [f"{t.name}-{t.value}" for t in tags]
        return "_".join(strings) + ".curry"

We’ve gone on long enough, and that is ragged enough to call for a break. So let’s reflect

Reflection

I’ve learned a bit from last night’s Pythonista exercise, and this morning I think I have learned that I want the time stamp to be provided at the time the file is created, and while it will be provided in the tags as we read them in, it will be replaced when we write things out. Also I think it would be really nice if we were to name it something like aardvark so that it would come first in the tags.

Probably we’ll call it time and force it to the front of the fie name string. This is all temporary anyway.

I am tempted to reverse the order of name and value in the tuple, so that I don’t have to hack the sort. On the other hand, now the sort says what it means and doesn’t now what the order actually is. A win for namedtuple there.

Clearly I can write a thousand files, and next time I think I’ll do it.

Bonus:

I found a lorem generator for python. Here’s a test:


def test_lorem():
    thing = TextLorem(srange=(3,3), prange=(2,2), trange=(4,4))
    # srange = number of words in sentence
    # prange = number of sentences in para
    # trangs = number of paras in text
    text = thing.text()
    print(text)
    assert False

That produced this:

Quisquam sed voluptatem. Porro quisquam voluptatem.

Adipisci sed labore. Voluptatem est ipsum.

Quisquam non velit. Numquam dolore quaerat.

Sed labore sit. Tempora est aliquam.

That’s how I figured out what the parameters meant. I think that will be handy for producing documents.

See you next time!