FAFO on GitHub

Spikes for the expression parsing have taught us enough. Let’s see how we can best move from the spike code into decent production objects. There’s one somewhat large issue: symbols. No, two: data types.

Hello, Friends!

I’d be within my rights to throw away the experimental code for lexing, parsing, and executing expressions, and surely what we wind up with needs to be quite different. But we’ll keep the test_functions code around, for reference, and possibly for reuse via refactoring or simple copy-pasta. Today, we’ll start building a more grown-up version of parsing and executing expressions.

There are at least two details that I’m concerned about.

One is “symbols”, the scope names that can appear in an expression:

gross_pay = salary + bonus

We do know how to get a value from a field, but in our expressions so far, we have never done it. So there could be dragons or possibly sea monsters lurking there, but I don’t expect any big issues. It’s just not done yet.

The second issue is data types. In access to real-world data, we are most likely to encounter only alphanumeric fields. But there are exceptions, such as the possible COBOL file containing packed decimal or double precision floating point. Our XSets can contain fields of any hashable type, and we actually have some tests where we use numbers, at least integers. Our expression tests of the past day or so even include a 37.5 if I’m not mistaken.

However, that 37.5 is presented as a string. If I am not mistaken, which assumes facts not in evidence, the current RPN experiment converts values from string to numeric, and pushes them back on the stack as numbers. I kind of wonder how that even works. I’ve made a note to look into that. I think we need to make a reasonable decision here, recognizing that on some future day, and that day may never come, we’ll be called upon to do something more robust. My current guess is that we’ll assume alpha in and alpha out, and try to deal with numbers in, still with alpha out.

Neither of these issues is serious but we do need to keep them in mind so that they’re dealt with, and in a reasonable fashion.

Tentative Plan

I think we need an Expression object that contains a parsed expression such that we can pass a record in and get a new element and scope out. That will mean that our expression parsing will need to deal with the gross_pay= part of things, which it currently does not.

Our lexing and parsing object will be responsible for producing an Expression. We’ll know more about that after we work out what the Expression needs to be. We’ll TDD that and then reassess.

But first, let’s get started by figuring out how numbers are going in and out of this thing.

That answer turns out to be easy! I love it when things are easy! Let me describe the code and then we’ll look at it.

The interpreter receives a list of strings representing the parsed expression converted to Reverse Polish Notation, such as

'21', '2', '*'

It runs that list through its make_executable function, which converts that list to a series of executable steps. The executable step for an operation, like ‘*’, is to pop two items off the interpreter’s stack and multiply them., pushing the result onto the stack. The executable step for an operand, like ‘21’, is to convert it to a number and push it onto the stack. After all is said and done, there is one number left on the top of the stack, and it’s the answer.

What makes it work when we push a number back onto the stack instead of a string is that our to_number function converts numbers back to themselves:

def to_number(string):
    try:
        return int(string)
    except ValueError:
        return float(string)

It turns out that int(42) is 42, int(37.5) raises an exception, and float(37.5) is 37.5. Yes, I have tests for that.

Nonetheless, I think our rule for the new Expression will be that it converts to number, does the math, and converts the result back to string, so that our records will be all string fields until further notice.

Expression Object

Let’s drive out an Expression object based on the code we have now, which looks like this:

We convert the input line to tokens with lex:

def lex(expr):
    no_space = expr.replace(' ', '')
    rx = '([^a-zA-Z0-9.])'
    return re.split(rx, no_space)

This rips out spaces, which we do not like, and then splits the line on non-alphanumeric values, capturing them. I freely grant that I found this on stack overflow or some such place, but it is tested and does what we need.

We then parse that list to RPN using Dijkstra’s shunting algorithm:

def dijkstra(lexed: list):
    priority = {'+': 1, '-': 1, '*': 2}
    stack = []
    output = []
    expr = lexed[::-1]
    while expr:
        token = expr.pop()
        if token[0].isalpha():
            output.append(token)
        elif token[0].isnumeric():
            output.append(token)
        else:
            o1 = token
            p1 = priority[o1]
            while stack and stack[-1] != '(' and priority[stack[-1]] > p1:
                o2 = stack.pop()
                output.append(o2)
            stack.append(o1)
    while stack:
        output.append(stack.pop())
    return output

This presently handles plus, minus, and times, because that was enough to satisfy my testing. I don’t remember where I stole that, probably Wikipedia. It will need enhancement for parens and slash and perhaps other extensions later.

The output of dijkstra is the input expression converted to RPN.

['Pay' '*' '0.10']

Will be turned into:

['Pay', '0.10', '*']

Our interpreter reverses the list, converts the list to executable, and simply executes the result in order:

def interpret(rpn):
    stack = []
    r = rpn[::-1]
    executable_r = make_executable(r)
    while executable_r:
        executable_r.pop()(stack)
    return stack.pop()

We reverse the list because we need to process it from left to right, first pushing the values and then executing the operations. Reversing it lets us process it by popping it.

We’ll skip looking at make_executable until later. Our focus now is on creating an Expression object that we can live with.

Expression via TDD

We’ll write tests and create the class and we allow ourselves the option of writing, copying, or refactoring to get our code. Right off the bat, it seems clear that the Expression should include only the RPN that has been converted to make_executable form, because that is not a function of the input to our Expression, which is only an XSet representing the record containing the input values.

Begin with a test, and I think we’ll do a new test file, called test_expressions. I do think we’ll include all of the lexing and parsing in there but we’ll see.

class TextExpressions:
    def test_create_Expression(self):
        expr = Expression('gross_pay', [])

This is enough to fail, since there is no Expression. I am already concerned about a couple of things: First, what is that list? It’s supposed to be a list of functions for the Expression to call. Should it be a named object? And second, how are we going to populate it without first doing the parsing?

The next thing I find out is that Expression is a well-known name in Python, part os _ast, which unless I miss my guess, stands for Abstract Syntax Tree. We may need to rename it. For now, we’ll go with it.

class Expression:
    def __init__(self, scope, operations):
        self._scope = scope
        self._operations = operations

Green. Commit: initial Expression test and class.

Now we know that the rules are that the Expression is going to return a result, so how about a test that returns a constant?

I’m not sure if we can manage to work without an actual parser and such, but we’ll try. If we need to do the other things first, we can start on them any time we need to. I’d just prefer to keep fewer balls in the air.

    def test_returns_constant(self):
        op = lambda stack: stack.append(42)
        record = None
        expression = Expression('Answer', [op])
        assert expression.scope() == 'Answer'
        assert expression.result(record) == 42

I decided to test returning the scope too. Yeah, I know, one assert. YMMV. When I think of a thing, I might type it in lest I forget. Test fails, has no such methods:

    def scope(self):
        return self._scope

    def result(self, record):
        return 0

Fails as expected:

Expected :42
Actual   :0

Fix it:

    def result(self, record):
        stack = []
        self._operations[0](stack)
        return stack.pop()

Passes. I realize that we really want to pass the record to the operations, not just the stack. Let’s fix our test to do that.

    def test_returns_constant(self):
        op = lambda stack, record: stack.append(42)
        record = None
        expression = Expression('Answer', [op])
        assert expression.scope() == 'Answer'
        assert expression.result(record) == 42

And:

    def result(self, record):
        stack = []
        self._operations[0](stack, record)
        return stack.pop()

OK we need a list to test, and we need a decision. What order is the list to be processed in, inside Expression? I think from high to low, via pop, so the RPN we provide has to be reversed. Who does the reversing? I think that should be left up to Expression, and we should provide real RPN in our tests and from our parser. I also note that I passed in and tested a number, not a string. I’ll change the test as a reminder that we’re assuming string inputs.

    def test_rpn(self):
        op21 = lambda stack, record: stack.append('21')
        op2 = lambda stack, record: stack.append('2')
        def plus(stack, number):
            op1 = to_number(stack.pop())
            op2 = to_number(stack.pop())
            stack.append(str(op1 + op2))

        operations = [op21, op2, plus]
        expression = Expression('Answer', operations)
        assert expression.result(None) == '42'

I think this is valid, let’s see if we can make it work. We cannot, because the op can’t find a to_number to execute. We’re going to need to pass self in to all our operations, to provide access to things like that. Change the code, fix the tests.

Note
It isn’t valid. 21 plus 2 is not 42. 21 times 2 is. We’ll find that out below.

After fixing the + to * and a few other things, we have these tests:

    def test_returns_constant(self):
        op = lambda self, stack, record: stack.append('42')
        record = None
        expression = Expression('Answer', [op])
        assert expression.scope() == 'Answer'
        assert expression.result(record) == '42'

    def test_rpn(self):
        op21 = lambda self, stack, record: stack.append('21')
        op2 = lambda self, stack, record: stack.append('2')
        def times(self, stack, number):
            op1 = self.to_number(stack.pop())
            op2 = self.to_number(stack.pop())
            stack.append(str(op1 * op2))

        operations = [op21, op2, times]
        expression = Expression('Answer', operations)
        assert expression.result(None) == '42'

And this class:

class Expression:
    def __init__(self, scope, operations):
        self._scope = scope
        self._operations = operations[::-1]

    def result(self, record):
        def to_number(string):
            return int(string)

        stack = []
        while self._operations:
            self._operations.pop()(self, stack, record)
        return stack.pop()

    def scope(self):
        return self._scope

    def to_number(self, string):
        try:
            return int(string)
        except ValueError:
            return float(string)

I think we could have been committing here and there but we can surely commit now: Expression class can execute valid RPN if you can find some.

Time to Reflect

The commit message is pretty much on point: if we can get some executable RPN in there, the expression will do what has to be done, so far so good. I am a bit concerned about where the code will reside that creates the actual functions that go into the operations list: it will have to know a lot about what the Expression can do. That said, every compiler has to understand its target machine somewhere, so it’s not like it’s a new problem in the universe.

We need to address fetching values from an XSet “record”, but I think it’s time to get the lexer and parser involved in this enterprise.

Let’s move Expression over to the src side. I’ve been coding it in the test class, as one does. Green. Commit move Expression to src tree.

I think this article is long enough. I might take a break, might carry on immediately. For this one, let’s sum up:

Summary

We’re not really refactoring from what we have to our new stuff. We are more using it as a sketch to guide the implementation. Other than to_number, we haven’t really copied any code from the experiments of the past few sessions, but we certainly have looked at them for ideas and understanding. That’s why we did them, after all.

We TDD’d a new object, Expression, that knows the name of a desired output field, and that executes a series of functions that operate on a stack, given an input record and the Expression object itself, for access to utility functions like to_number. The Expression object has no idea what is going on: it just does the functions in it list.

There is an issue with that, and we see it in the second test:

    def test_rpn(self):
        op21 = lambda self, stack, record: stack.append('21')
        op2 = lambda self, stack, record: stack.append('2')
        def times(self, stack, number):
            op1 = self.to_number(stack.pop())
            op2 = self.to_number(stack.pop())
            stack.append(str(op1 * op2))

        operations = [op21, op2, times]
        expression = Expression('Answer', operations)
        assert expression.result(None) == '42'

As things stand, each function will have to pop its parameters, convert them if needed, execute its operation, and convert and stack the result. The only difference between those functions will be the operation, plus, times, minus, divide. This is what we in my line of work call “duplication” and we are, as a rule, against it.

We could change the definition of the operations, of course, and one interesting possibility is to make them tiny objects containing functions, instead of just functions, and the objects could include the warm up and cool down aspects of the various operations. That would certainly break the current tests, but we’ll not worry about that. By the time we can see what’s needed, we’ll have plenty of tests and anyway, if some need improvement, we’ll improve them.

I am starting to suspect that my assumptions about what Expression will receive and how it will work are not entirely correct. This is no surprise: my assumptions about the future are almost always incorrect. We’re not trying to call our shot days out from now: we are trying to take small steps in approximately the right direction.

And that, I feel sure, we have done. Even if the details aren’t quite right, those two tests of Expression are quite nice. They feel to me like decent tests for a small interpreter.

Next time, we’ll probably proceed with lexing and parsing, so as to work out the relationship between the parsing side and the doing side. Along the way, we’ll pick up pulling values out of records, but that’s a solved problem and only the details need to be sorted.

Things are going nicely.

See you next time!