The Robot World Repo on GitHub
The Forth Repo on GitHub

#StopTheCoup! Some thoughts on using compiled words for things like looping. Our next priority is loading definitions from files. How might we do that?

Branching Between Words

I’ve sketched, a couple of times, a Forth-language definition that could almost do the complex loop-ending that is our *LOOP operation:

class Forth:
    def star_loop(self):
        new_index = self.return_stack.pop() + 1
        limit = self.return_stack.pop()
        beginning_of_do_loop = self.next_word()
        if new_index < limit:
            self.return_stack.push(limit)
            self.return_stack.push(new_index)
            self.active_word.branch(beginning_of_do_loop)

All of that could be done in a Forth word, with one question arising: how could we do the branch at the end? As I was trying not to wake up this morning, I gained a little clarity on why that’s difficult.

In our Forth, each compiled word is a separate callable object, or, looking at it another way, each compiled word is a separate function. Each one has its own program counter, allowing branching within that definition. Since Forth conditionals and loops can only be used in definitions, and the whole construct must appear in a single definition, we can branch within a word by changing the word’s program counter, and in fact that is just what we do:

class Word:
    def branch(self, word_address):
        self.pc = word_address

In a classical Forth, words are all compiled into a single long array of cells. Those cells include word names, typically threaded together, the code that starts the word running (there are a few ways to do that), and, in the case of a word made up of other words, the addresses or indices of those words. A compiled word’s starting code pushes a return address onto the return stack and the code at the end pops the stack, so that the code returns to wherever it came from.

So, in principle, a compiled word can manipulate the stack in such a way as to cause a loop back into another word entirely. I would have to draw a number of eight by ten color glossy diagrams to figure out just how to handle the return stack in such an instance, and I have not done that work.

Now, also in principle, our Forth could manipulate the active_words stack to make the same thing happen, but, where the classical Forth’s address space is continuous, ours is not, so it isn’t at all clear where we would find the address to which we were to branch.

I am confident that with sufficient work, we could modify our Forth to permit branching words to be written as compiled Forth words. Right now, we have exactly one such word, *LOOP, and the effort is not justified in my sole and ruling view. I might become fascinated by the problem and decide to deal with it on that basis, but I’d rather get this thing useful, and maybe even get it hooked up to the bot world someday.

Branches

One way to approach the issue of branches between words would be to refactor our program to keep all the word definitions in one big list. Then all the program counter values would be indices into that one list instead of having only the local context that they have now.

Threaded Code

A classical Forth’s code is threaded. It branches everywhere, and some of the places it branches to may save an address on a stack and other places it branches to may pop those values. When a word reaches its end, it branches to one of a few known locations, depending on what is to be done next. Some of the places that are branched to just drop into other places that might be branched to.

We do not have a goto capability in Python. We might be able to simulate it somehow, since there are only a finite number of places to which Forth will branch inside itself. Words would still have to be called, I think, unless we did something more weird than I can presently envision. Be that as it may, it might make certain Forth things easier if we hewed closer to the threaded code line.

Globals

Classical Forth is basically just one not very large assembly-language program. As such, all the interesting locations and data structures are globally accessible to the whole program. In our Forth, we have a Forth class and make Forth instances, and all those would-be globals are members of the class and accessed accordingly.

It would be possible to make Forth with pure functions, not methods of a class, and with all the stacks and such global variables. This idea flies in the face of everything I’ve learned about object-oriented programming, but it would be, in some important sense, simpler, and it would surely be “closer” to the classical Forth style.

We have no compelling reason to make our Forth more like a classical one, other than the sheer exercise of doing it. I doubt that we’ll move further in that direction, but I do value continuously considering alternative ways of doing what I’m doing. It’s fun, and once in a while I get an idea that makes things better, or that is just so much fun that I can’t resist.

For now, we’ll stick with what we’ve been doing.

Loading from Files

Once one has the basics in place — and I think we’re just about there — using Forth consists of defining words for solving the particular problem one has in hand, whether that is controlling some machine with a tiny microprocessor, or controlling some robots in a virtual world, or, well, anything at all.

Rather than type those words in every morning, which is tedious and error-prone, we’d like to have one or more Forth words for reading Forth from files. The Forth way of doing this is that the files just contain plain old Forth, either words to be executed or definitions via colon or CREATE. As the file is processed, definitions are built up and when the file is done and the Forth prompt appears again, Forth now knows all the words for solving the problem, whatever it is. Or, optionally, the last word just starts the program running, controlling the machine or the robots, and a Forth prompt never comes out again.

I think the standard word for reading from a file is INCLUDE. If it isn’t, it should be. And, we can see easily that an INCLUDE file might want to do another INCLUDE, so we’ll need to allow for some amount of nesting. Probably not too much, lest something bad happen. Probably adjustable. I don’t know, just musing.

A file name is a string. So far we don’t have strings. A common approach in Forth is to have a free-standing double quote mark (followed by space) signal that a string follows, and the string consists of everything form the next non-blank character to a closing quote. It is common for a string to be represented on the stack by an address and a count of the number of characters in the string.

We are not limited to that notion. Our stack could quite easily hold a string, since our stack is actually a stack of pointers to objects, all of which so far happen to be numbers. But they could be strings. I think we’ll assume that we’ll just put the string object on the stack. If that’s not good, we’ll refactor.

We have a bit of an issue with tokens. Currently we take a line and split it into tokens. Consider this line of Forth:

2 3 + " The answer is:" ." .

That will produce this list of string tokens:

[‘2’, ‘3’ ‘+’, ‘”’, ‘The’, ‘answer’, ‘is:”’, ‘.”’, ‘.’]

And we will have lost the spacing between the words, which need not have consisted of single spaces.

In a classical Forth, a line of text is presented, with an index, and Forth ticks along looking at characters until it encounters a space and voila! that’s a token. I think our best move will be to use that approach.

We have our first task:

Tokens

Here’s our current code. We call compile with input lines. That’s where everything starts.

    def compile(self, text):
        try:
            self.process_line(text)
        except Exception as e:
            msg = str(e)
            if msg  == 'Unexpected end of input':
                return '...'
            else:
                self.abend()
                return f'{e} ?'
        return 'ok'

    def process_line(self, text):
        new_text = re.sub(r'\(.*?\)', ' ', text)
        self.tokens = new_text.split()
        while self.tokens:
            self.process_token(self.next_token())
        if self.compilation_state:
            raise ValueError('Unexpected end of input')

    def process_token(self, token):
        definition = self.get_definition(token)
        if not self.compilation_state or definition.immediate:
            definition(self)
        else:
            self.word_list.append(definition)

It’s worth mentioning that get_definition may return a defined word or a literal, or raise an exception if it finds neither. I imagine that we’ll be changing that to deal with strings as another option. But for now, we want to change to a format that leaves the input line untouched, so that our string finder can see it in its pristine form.

I think we’ll do our work in process_line, though we might wind up rearranging responsibilities later.

I’m not quite sure how this code will want to look, so let’s write some tests and see what we think.

After a fair amount of messing about, I have this sketch:

class TestTokenizing:
    def test_get_token(self):
        line = '  abc '
        index = 0
        skipping = True
        token = ''
        while index < len(line):
            char = line[index]
            if char == ' ' and skipping:
                index += 1
            else:
                skipping = False
                if char != ' ':
                    token += char
                    index += 1
                else:
                    break
        assert token == 'abc'
        assert index == 5

I am not fond of this code, because when I read it I’m not sure that it works as intended: I can’t take it in as a whole.

Furthermore, just because some naive assembly version of Forth would do it that way isn’t a good reason why someone with Python would do it that way.

Let’s restate the problem:

Given a string, return the first blank-separated substring, and the remaining string, ignoring leading blanks.

    def test_split_off_token(self):
        line = 'abc d ef g hij'
        token, line = split_off_token(line)
        assert token == 'abc'
        token, line = split_off_token(line)
        assert token == 'd'
        token, line = split_off_token(line)
        assert token == 'ef'
        token, line = split_off_token(line)
        assert token == 'g'
        token, line = split_off_token(line)
        assert token == 'hij'

Seems like a decent test and a sensible way for the function to work. And:

def split_off_token(line):
    trimmed = line.lstrip()
    index = trimmed.find(' ')
    if index == -1:
        return trimmed, ''
    else:
        return trimmed[:index], trimmed[index:]

Ah, I think I like that one. We’ll want to deal with other white space, I suppose. Let’s burn that bridge when we come to it, however. If there’s a weird character in there, we’ll get a weird token and won’t understand it.

I’ll move that function into the Forth file, make it a method there. It’ll be static soon enough.

But there’s more to deal with. We currently have a method next_token that is used not just in the process_line method but also two other places:

    def _define_colon_semi(self):
        def _colon(forth):
            forth.word_list.clear()
            forth.compile_stack.push(forth.next_token())
            forth.compilation_state = True

    def _define_create_does(self):
        def _create(forth):
            address = forth.heap.next_available()
            literal = forth.find_word('*#')
            name = forth.next_token()
            word = Word(name, [literal, address])
            forth.lexicon.append(word)

So I think we need to promote the line to a member variable and use next_token. We’ll have next_token call the current split_off_token at first but then I think we’ll combine them.

This takes me longer than I’m proud of, but here we are, green:

    def process_line(self, text):
        self.input_line = re.sub(r'\(.*?\)', ' ', text)
        while self.input_line:
            self.process_token(self.next_token())
        if self.compilation_state:
            raise ValueError('Unexpected end of input')

    def next_token(self):
        token, self.input_line = self.split_off_token(self.input_line)
        return token

    @staticmethod
    def split_off_token(line):
        trimmed = line.strip()
        index = trimmed.find(' ')
        if index == -1:
            return trimmed.upper(), ''
        else:
            return trimmed[:index].upper(), trimmed[index:].strip()

I made at least these mistakes before getting here:

  • I forgot the self on the call to split_off_token, so was calling the one in the test and editing the one in Forth. Not very effective.

  • The original version was not converting tokens to upper case, which is required.

  • The original version could return an empty string, when the input string ended with space. I think that could have been fixed by skipping the space on the return, but I fixed it by stripping over and over.

We are green, therefore commit: tokens now read one at a time from the input line.

We can clean up some things. I think we no longer use the tokens member on Forth. That is the case. Remove it, commit.

Let’s inline split_off_token into next_token. PyCharm does it oddly:

    def next_token(self):
        trimmed = self.input_line.strip()
        index = trimmed.find(' ')
        if index == -1:
            result = trimmed.upper(), ''
        else:
            result = trimmed[:index].upper(), trimmed[index:].strip()
        token, self.input_line = result
        return token

Let’s do it this way:

    def next_token(self):
        trimmed = self.input_line.strip()
        index = trimmed.find(' ')
        if index == -1:
            token, self.input_line = trimmed.upper(), ''
        else:
            token, self.input_line = trimmed[:index].upper(), trimmed[index+1:].strip()
        return token

I tried some options like lstrip at the top and rstrip at the bottom, which might be a bit more efficient, but it seemed less clear: we have to think about what’s going on and with strip we need to think less.

Commit: tidying. And that’ll do for now.

Summary

We need to make parsing quoted strings possible, ideally “easy”. To do that, it seemed to me that rather than splitting the input into tokens based on space, we needed to split off a single token at a time. I tried that longhand-with an index and all that, and it just got tedious and tricky. Doing it with sensible string operations seemed like a better idea. I think that turned out to be the case.

I had forgotten that there were multiple users of next_token, so after installing the new token code in process_line, 66 tests broke. I wonder what the deal was with the other 40 that didn’t break. Anyway, that required me to redo next_word and use it. I fumbled with that a bit, first referencing the wrong function, so that my edits seemed to have no effect, and then forgetting the need to upper-case the tokens.

After that fumbling, it all went well and aside from a moment’s confusion, everything was always well in hand and we were only a few lines away from a clean rollback or a commit.

Next time, probably, we’ll address string literals, leading to file names, leading to INCLUDE.

#StopTheCoup! See you next time!