Munching Forth

The Robot World Repo on GitHub
The Forth Repo on GitHub

It is literally oh-dark-thirty, and I am awake. Time to work on Forth. Is there a larger lesson? Yes, I think so. Things go well!

Now I freely grant that I write these programs and articles for my own purposes, keeping my mind functioning and keeping the demons out of my head. But there is a general point that I continually discover and want to bring across:

When our design is not what we now know that it “should be” or “could have been”, we can very likely improve it substantially in very small steps, without the need to rewrite or to pause progress while we “refactor”.

How likely is “very likely”? My best guess is that it’s “nearly always”. To get where we want to be might take very many small steps, and they might need to be taken over an extended period of time, punctuated by a lt of “feature” work, but I think it’s nearly always possible.

I could be wrong …¹

Today

I am up before the crack of dawn because I woke up and got to thinking, and when it became clear that I wasn’t going back to sleep, I decided to get up and work on our little Forth system.

The code and I are in a strange situation. I don’t like the code. It is jaggy and has lots of ifs and nooks and crannies. However, I do not see a better design than what I have. Sometimes, one does see it, “Oh, what we need here is” kind of thing. In this case, I do not have a grand idea for an overall better design. What I do have is some notions, and the ability to refactor.

Here are a few of those notions:

We have a new state, compilation_mode, True/False. Making better use of this state will probably improve things.
I feel like Forth’s “normal” mode is interpreting. Fetch a word, do it. Compilation mode is kind of secondary.
We can probably do more things with Words, and fewer with special cases or exceptions.
We will likely benefit from moving responsibilities around among the main program, the compile and unsafe_compile methods, and the compile_a_word method. It is this possibility that is most likely to lead to a design that is visibly better.

Let’s get started. I want to begin by reviewing compile_a_word, yet again, with an eye toward better using the compilation mode. And I’m going to commit what we have. It’s working, it’s a bit better, and I just made the point that I believe I can “always” get where I need to go, so if this isn’t the right path, we’ll manage somehow.

    def unsafe_compile(self, text):
        new_text = re.sub(r'\(.*?\)', ' ', text)
        self.tokens = new_text.split()
        self.token_index = 0
        while self.token_index < len(self.tokens):
            word = self.compile_a_word()
            word(self)

    def compile_a_word(self):
        while True:
            token = self.next_token()
            if token is None:
                raise ValueError('Unexpected end of input')
            if (definition := self.find_word(token)) is not None:
                if definition.immediate:
                    definition(self)
                else:
                    self.word_list.append(definition)
            elif (num := self.parse_number(token)) is not None:
                self.compile_literal(num, self.word_list)
            else:
                raise SyntaxError(f'Syntax error: "{token}" unrecognized')
            if not self.compilation_state:
                break
        word = Word('nameless', self.word_list[:])
        self.word_list.clear()
        return word

Our “main loop”, there in unsafe_compile, just repeatedly calls compile_a_word and executes the result.

I have a bit of discomfort with this setup. I have just a vague feeling that the responsibilities aren’t allocated quite right between these two methods. We’ll hold onto that discomfort and proceed.

Let’s consider the two states, compilation state and, um, not. Ha! That suggests that our compilation_state boolean isn’t quite the thing. Another item to hold on to. We’re after bigger game just now. Now that there are two things to remember, I write them on a card.

Here’s how I think it should work, not how it does work:

If we have a token and we are not in compilation state, and we find a word for that token, we should return it immediately and execute it. Or just execute it, which is part of why I think responsibilities aren’t quite right here. (This may be suggesting that we should inline the code and refactor. It might come to that.)

If we find a word and we are in compilation state, we should add the word to the word being compiled. As things stand, we cannot yet return to unsafe_compile.

It seems to me that unsafe_compile should be named interpret_a_token or something like that. It doesn’t always compile. Ditto for the compile method above it, not shown, which really just fields exceptions. I’m going to hold off on renames, still thinking. Note made.

If we do not find a word, we determine whether we have a number. If we have a number, build the word that pushes the number onto the stack. If we are not in compilation state, return it. If we are in compilation state, add it to the word being compiled.

If we do not find a number, we have an error. We could return a word that prints the error in this case, or continue to raise an exception.

So what if we had a method that converts a token to a word unconditionally, with the third possibility being a word that emits an error or raises an exception, whichever seems best?

Similarly, why do we raise an exception when we find no token? Why aren’t we just returning a word that says we reached the end of the line or hit a carriage return or whatever we call it?

It might be named “ok”. And why are we looping in compile_a_word? Why not retain the compilation state and let unsafe_compile, which is paying attention to the length of the token list, keep calling us?

Removing that loop would be kind of a big deal.

I feel as if we could try one or all of these ideas:

Execute words directly inside compile_a_word when we find them.
Have unsafe_compile pass a token to compile_a_word. What would it do upon the loop ending? Maybe nothing. It does nothing now.
Have compile_a_word just accumulate or execute as it sees fit. No one needs to know.

Let’s try it. This may be a spike. The change seems larger than I’d like. Maybe if we split compile_a_word at the top based on state. We can do that first, I think.

… a bit of fumbling happens …

I’ve gone further than I predicted above, not quite as predicted. Here’s a modified compile_a_word, working:

    def compile_a_word(self):
        while True:
            token = self.next_token()
            if token is None:
                raise ValueError('Unexpected end of input')
            found_word = self.find_word(token)
            if found_word is None:
                num = self.parse_number(token)
                if num is not None:
                    literal = self.find_word('*#')
                    found_word = Word('', [literal, num])
                else:
                    raise SyntaxError(f'Syntax error: "{token}" unrecognized')
            if not self.compilation_state:
                return found_word
            if (definition := found_word) is not None:
                if definition.immediate:
                    definition(self)
                else:
                    self.word_list.append(found_word)
            if not self.compilation_state:
                break
        word = Word('nameless', self.word_list[:])
        self.word_list.clear()
        return word

Don’t try too hard to understand that, we’ll improve it in a moment.

In this new scheme, we added code that unconditionally produces a word, found_word, unless we raise the syntax exception. ‘found_word’ is either a named word or a newly created literal word. Then, if we are not compiling, we return what we found, because we’re interpreting, so we return it to be called.

If we are compiling and it’s an immediate word, we do it. Otherwise we add it to the list. A bit further along in the article, we’ll return it rather than execute it. I have changed a bit more than I like to just to get here. But we’re green, so it’s OK.

We still have the infinite loop. Let’s refactor a bit, pulling out the word creation bit.

    def compile_a_word(self):
        while True:
            token = self.next_token()
            if token is None:
                raise ValueError('Unexpected end of input')
            found_word = self.find_word_or_literal(token)
            if not self.compilation_state:
                return found_word
            if (definition := found_word) is not None:
                if definition.immediate:
                    definition(self)
                else:
                    self.word_list.append(found_word)
            if not self.compilation_state:
                break
        word = Word('nameless', self.word_list[:])
        self.word_list.clear()
        return word

    def find_word_or_literal(self, token):
        found_word = self.find_word(token)
        if found_word is None:
            num = self.parse_number(token)
            if num is not None:
                literal = self.find_word('*#')
                found_word = Word('', [literal, num])
            else:
                raise SyntaxError(f'Syntax error: "{token}" unrecognized')
        return found_word

I’m going to commit and keep committing frequently. I think this is going well. (I really could be wrong, but I’m not.) Commit: refactoring

Clean up find_word_or_literal, just because we can:

    def find_word_or_literal(self, token):
        found_word = self.find_word(token)
        if found_word:
            return found_word
        num = self.parse_number(token)
        if num is not None:  # might be zero
            literal = self.find_word('*#')
            return Word('', [literal, num])
        else:
            raise SyntaxError(f'Syntax error: "{token}" unrecognized')

A bit better. Commit. Now clean up compile_a_word a bit. It starts like this:

    def compile_a_word(self):
        while True:
            token = self.next_token()
            if token is None:
                raise ValueError('Unexpected end of input')
            found_word = self.find_word_or_literal(token)
            if not self.compilation_state:
                return found_word
            if (definition := found_word) is not None:
                if definition.immediate:
                    definition(self)
                else:
                    self.word_list.append(found_word)
            if not self.compilation_state:
                break
        word = Word('nameless', self.word_list[:])
        self.word_list.clear()
        return word

Substitute found_word for definition:

    def compile_a_word(self):
        while True:
            token = self.next_token()
            if token is None:
                raise ValueError('Unexpected end of input')
            found_word = self.find_word_or_literal(token)
            if not self.compilation_state:
                return found_word
            if (found_word) is not None:
                if found_word.immediate:
                    found_word(self)
                else:
                    self.word_list.append(found_word)
            if not self.compilation_state:
                break
        word = Word('nameless', self.word_list[:])
        self.word_list.clear()
        return word

The found_word cannot be None, so:

    def compile_a_word(self):
        while True:
            token = self.next_token()
            if token is None:
                raise ValueError('Unexpected end of input')
            found_word = self.find_word_or_literal(token)
            if not self.compilation_state:
                return found_word
            if found_word.immediate:
                found_word(self)
            else:
                self.word_list.append(found_word)
            if not self.compilation_state:
                break
        word = Word('nameless', self.word_list[:])
        self.word_list.clear()
        return word

Getting there, I think. Can word_list have anything in it at this point? I would think not.

    def compile_a_word(self):
        while True:
            token = self.next_token()
            if token is None:
                raise ValueError('Unexpected end of input')
            found_word = self.find_word_or_literal(token)
            if not self.compilation_state:
                return found_word
            if found_word.immediate:
                found_word(self)  # <=== 
            else:
                self.word_list.append(found_word)
            if not self.compilation_state:
                return Word('no-op', [])

We have remained green throughout. Commit.

Now what about that call to execute found_word? What would happen if we were to return it? It would get executed and we’d be called back.

    def compile_a_word(self):
        while True:
            token = self.next_token()
            if token is None:
                raise ValueError('Unexpected end of input')
            found_word = self.find_word_or_literal(token)
            if not self.compilation_state:
                return found_word
            if found_word.immediate:
                return found_word  # <===
            else:
                self.word_list.append(found_word)
            if not self.compilation_state:
                return Word('no-op', [])

Still green. Commit.

I’d like to get rid of the while True. Just removing it breaks 49 tests. This way breaks just two:

    def compile_a_word(self):
        token = self.next_token()
        if token is None:
            raise ValueError('Unexpected end of input')
        found_word = self.find_word_or_literal(token)
        if not self.compilation_state:
            return found_word
        if found_word.immediate:
            return found_word
        else:
            self.word_list.append(found_word)
        return Word('no-op', [])

The two are both dealing with error conditions:

    def test_partial_definition(self):
        f = Forth()
        with pytest.raises(ValueError) as e:
            f.unsafe_compile(': FOO 444 222 +')
        assert str(e.value) == 'Unexpected end of input'

    def test_safe_compile_needs_more_input(self):
        f = Forth()
        result = f.compile(': FOO 42 ')  # no semicolon
        assert result == '...'

I think we fix the first one up in unsafe_compile. In fact, this fixes both:

    def unsafe_compile(self, text):
        new_text = re.sub(r'\(.*?\)', ' ', text)
        self.tokens = new_text.split()
        self.token_index = 0
        while self.token_index < len(self.tokens):
            word = self.compile_a_word()
            word(self)
        if self.compilation_state:
            raise ValueError('Unexpected end of input')

Ha! Now we can pull the token in unsafe_compile and pass it to compile_a_word. And we know we will never pass a None because unsafe_compile is checking the length, so we can remove that check as well.

    def unsafe_compile(self, text):
        new_text = re.sub(r'\(.*?\)', ' ', text)
        self.tokens = new_text.split()
        self.token_index = 0
        while self.token_index < len(self.tokens):
            token = self.next_token()
            word = self.compile_a_word(token)
            word(self)
        if self.compilation_state:
            raise ValueError('Unexpected end of input')

    def compile_a_word(self, token):
        found_word = self.find_word_or_literal(token)
        if not self.compilation_state:
            return found_word
        if found_word.immediate:
            return found_word
        else:
            self.word_list.append(found_word)
        return Word('no-op', [])

Refactor the latter method to remove the duplicated return statements:

    def compile_a_word(self, token):
        found_word = self.find_word_or_literal(token)
        if not self.compilation_state or found_word.immediate:
            return found_word
        else:
            self.word_list.append(found_word)
            return Word('no-op', [])

Reflection

This is much improved. We’ve reduced the ifs quite a bit, and removed the irritating while True. We’re a bit closer to the classical Forth behavior which is that it is always interpreting a token at a time, even if those tokens happen to be being compiled.

I think there are things that could be better. We could build a “NO_OP” word and use it instead of creating a new one every time. We could cache it rather than look it up. Renaming and inverting compilation_state to interpretation_state would avoid a not. Making it a enum might improve readability.

I think we’ll hold off on those util next time. This is feeling like a good place to stop.

We could benefit from some renaming as well. compile_a_word is really only compiling sometimes. Same with unsafe_compile and compile.

Since both unsafe_compile and compile_a_word are referring to commpiation_state, I suspect that a different allocation of responsibilities might be called for.

Bottom line, I suspect that we can make further improvements.

We should also consider returning words that either look like errors or raise exceptions, depending on whether we’re running headless. Forth generally executes words to decide what to do: we might benefit from following that approach.

Summary

\The thing that unblocked the changes was to insert code that always found a word, whether it was from the lexicon or a number, and then processed it. Then a series of small changes, referring to the found one, using it, gradually unwound things.

I do think it could have been done more cleverly, but I never got in enough trouble to really back out, and the result seems much improved and a good platform for further improvement, so I am pleased with the morning’s work. It was worth getting up early, because my early thoughts led to a better place.

I do foresee that we’ll find something even better, perhaps by inlining and then extracting differently. We’ll fine out … next time. See you then!

I know two possible endings to this phrase. “I frequently am”, which reflects reality. As a programmer, we make a lot of mistakes and we might as well face it. And, quoting Eagles: “But I’m not”. In this case, the former surely applies, because it always does, and I’m actually rather sure about the latter. In any case, you get to do whatever you wish, in the light of your own experience. I’m just sayin’. ↩