Fantastic day! Completely augured in. Recovered. Brilliant.

It all started perfectly reasonably …

Let’s pretend that we’ve decided to make our FSM into a reasonably general-purpose Finite State Machine tool, for use in all our many products here chez Ron. With that in mind, we might choose to provide for all the callbacks that might be needed, and we’d want the code to be as bright and habitable as possible.

So let’s do that. What we won’t do is pre-process the input table for faster execution. We have no reason to fear the current speed, so we’ll save that concern for version 2. or something.

We’ll follow the lead of the other Lua versions, and the JavaScript version, in naming the callbacks. If they’re consistent. Let’s look.

Undefined (Daniel Perez Alvarez):

Four types of callback are available by attaching methods to your state machine, using the following naming conventions (where and get replaced with your different event and state names):

  • on_before_ - fired before the event
  • on_leave_ - fired when leaving the old state
  • on_enter_ - fired when entering the new state
  • on_after_ - fired after the event

For convenience, the 2 most useful callbacks can be shortened:

  • on_ - convenience shorthand for on_after_
  • on_ - convenience shorthand for on_enter_

In addition, four general-purpose callbacks can be used to capture all event and state changes:

  • on_before_event - fired before any event
  • on_leave_state - fired when leaving any state
  • on_enter_state - fired when entering any state
  • on_after_event - fired after any event

All callbacks will be passed the same arguments:

  • self (the finite-state machine that generated the transition)
  • event name
  • from state
  • to state

(followed by any arguments you passed into the original event method)

Kyle Conroy

4 callbacks are available if your state machine has methods using the following naming conventions:

  • onbeforeevent - fired before the event
  • onleavestate - fired when leaving the old state
  • onenterstate - fired when entering the new state
  • onafterevent - fired after the event

You can affect the event in 3 ways:

  • return false from an onbeforeevent handler to cancel the event.
  • return false from an onleavestate handler to cancel the event.
  • return ASYNC from an onleavestate or onenterstate handler to perform an asynchronous state transition (see next section)

For convenience, the 2 most useful callbacks can be shortened:

  • onevent - convenience shorthand for onafterevent
  • onstate - convenience shorthand for onenterstate

In addition, a generic onstatechange() callback can be used to call a single function for all state changes:

Jake Gordon (JavaScript):

Lifecycle Events

In order to track or perform an action when a transition occurs, five general-purpose lifecycle events can be observed:

  • onBeforeTransition - fired before any transition
  • onLeaveState - fired when leaving any state
  • onTransition - fired during any transition
  • onEnterState - fired when entering any state
  • onAfterTransition - fired after any transition

In addition to the general-purpose events, transitions can be observed using your specific transition and state names:

  • onBefore - fired before a specific TRANSITION begins
  • onLeave - fired when leaving a specific STATE
  • onEnter - fired when entering a specific STATE
  • onAfter - fired after a specific TRANSITION completes

For convenience, the 2 most useful events can be shortened:

  • on - convenience shorthand for onAfter
  • on - convenience shorthand for onEnter

Consistency Was Too Much To Ask For

We continued along smoothly …

OK. They all seem to follow much the same pattern. We currently only allow checking on state, not on event, and, where XXX is the name of a specific state, our names are:

  • on_leave_state
  • on_leave_XXX
  • on_enter_state
  • on_enter_XXX

I think we are checking these in the wrong order. Here’s the code for that:

function FSM:event(event, ...)
    local args = {...}
    return self:performAccepted(event, function(trans)
        if false == self:doCallback("on_leave_state", event, trans, args) then
            return
        end
        if false == self:doCallback("on_leave_"..self.fsm_state, event, trans, args) then
            return
        end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
    end)
end

We check on_leave_state before checking the specific one. Arguably we should check specifics before generals. In addition, and I got this idea from one of the other versions, we should probably run both of those conditioned ones before exiting. As a rule, I think I’d want to say that all the relevant callbacks will always be executed.

Let’s fix that right now.

function FSM:event(event, ...)
    local args = {...}
    return self:performAccepted(event, function(trans)
        local specific = self:doCallback("on_leave_"..self.fsm_state, event, trans, args)
        local generic = self:doCallback("on_leave_state", event, trans, args)
        if specific or generic then return end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
    end)
end

Now it’s specific before generic, and all relevant are executed. Commit: all relevant state callbacks are always called, specific first.

Now what about those names? Our usual naming convention here is camelCase. Because these aren’t methods, we can make a case that they should be different. And plenty of Lua is written in snake_case, perhaps even the bulk of it. We’ll leave them.

What about the event-focused callbacks? I came in here planning to cater to them. But we have no need of them.

In favor of doing it is that we are fresh on the FSM and so this is a good time to do something like the event callbacks. If a different team, or even our team, works on extending FSM later, they may not be as fresh and might not do it as well as we can.

Yeah, no. Not gonna do it. FSM Release One has state callbacks. Doesn’t have event callbacks. Get in touch if you need them, we’ll give you two-day turnaround on a new version.

Plan Changed

We changed direction a bit, but that’s OK …

The plan changed. Is that bad? I don’t think so. With deeper consideration and a closer connection to the code and to what’s out there, we made a different decision than the one we made earlier. Why should we stick with an inferior plan when we now have a better one?

The main reason, it seems to me, is that “they” don’t like surprises. If that’s true, and if it matters, it seems to me that we’re faced with two choices:

  1. Stick with inferior plans because “they” don’t like change.
  2. Don’t tell “them” so much about what you’re doing. Focus on results, not details.

Here’s Release One. It has this stuff. If you need more, let us know and we’ll put it in the priority list wherever you want it.

A Code Issue

We want to improve the code now that we (think) we have a clean commit.

There is a code issue that I thought we should look at. We’ve looked at the method in question already, but here it is with a bit more context:

function FSM:event(event, ...)
    local args = {...}
    return self:performAccepted(event, function(trans)
        local specific = self:doCallback("on_leave_"..self.fsm_state, event, trans, args)
        local generic = self:doCallback("on_leave_state", event, trans, args)
        if specific or generic then return end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
    end)
end

function FSM:performAccepted(event, f)
    for i,trans in ipairs(self.tab.events) do
        if self:fromMatches(trans) and trans.event == event then
            if f then f(trans) end
            return true
        end
    end
    return false
end

function FSM:doCallback(callbackName, event, trans, args)
    local cb = self.tab.callbacks
    if cb[callbackName] then
        return cb[callbackName](self, event, trans.from, trans.to, table.unpack(args))
    else
        return true
    end
end

The method performAccepted is passed an event and a function, and if there is a valid transition for that event, it executes the function, passing it the transition. It returns true or false, depending whether it found a usable transition, and the event method returns that value, so that, in principle, we can check whether the event was handled.

(Some of the FSMs we studied throw exceptions when events can’t be accepted. We don’t like exceptions, and so we simply return a boolean result. If you ignore that, an event that isn’t accepted just does nothing.)

The question in my mind is whether that performAccepted method is too tricky, with the embedded function. We could probably just as well return a transition or nothing and write the rest of that method inside a simple if, instead of as an embedded function.

Well, I don’t know. Let’s not talk about it, let’s try it and see what we think.

Sensibly, we try it rather than debate it …

The main method, first pass, would look like this:

function FSM:event(event, ...)
    local args = {...}
    local trans = self:acceptedTransition(event)
    if not trans then
        return false
    else
        local specific = self:doCallback("on_leave_"..self.fsm_state, event, trans, args)
        local generic = self:doCallback("on_leave_state", event, trans, args)
        if specific or generic then return end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
        return true
    end
end

And the new acceptedTransition:

function FSM:acceptedTransition(event)
    for i,trans in ipairs(self.tab.events) do
        if self:fromMatches(trans) and trans.event == event then
            return trans
        end
    end
    return nil
end

That’s exactly the same shape as perform was.

AIEEE, crash and burn!!!

Well. I certainly thought that was going to work, but no. All the FSM tests fail, and some of the NPC ones as well.

Begin to find out how bad it is.

Wait! Revert!

Somehow I failed to run the tests for the previous change. And the tests are all failing, and I know why:

function FSM:event(event, ...)
    local args = {...}
    return self:performAccepted(event, function(trans)
        local specific = self:doCallback("on_leave_"..self.fsm_state, event, trans, args)
        local generic = self:doCallback("on_leave_state", event, trans, args)
        if specific or generic then return end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
    end)
end

That if statement is backward, and needs to return false:

        if not specific or not generic then return false end

That fixes all but one:

4: Callbacks  -- Actual: false, Expected: true

I am sore tempted to revert the entire morning. In fact, I’ll check out last night, just to see if it even worked then.

Once I allow Working Copy time to do the checkout, I find that the code worked yesterday.

The issue relates to something I did above. Fine. Flush today’s work and start over, it was just a few lines. I think the issue will come from the reordering of the callbacks, but I don’t see why.

Brilliant recovery decision: start over.

Start over, smaller steps. Checkout today’s work. Undo. Revert to last night. Run tests. Green.

We believe that we should reorder the specific-generic here:

function FSM:event(event, ...)
    local args = {...}
    return self:performAccepted(event, function(trans)
        if false == self:doCallback("on_leave_state", event, trans, args) then
            return
        end
        if false == self:doCallback("on_leave_"..self.fsm_state, event, trans, args) then
            return
        end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
    end)
end

This is a behavior change, though in my mind I feel it’s harmless, but we’ll go slower:

function FSM:event(event, ...)
    local args = {...}
    return self:performAccepted(event, function(trans)
        if false == self:doCallback("on_leave_"..self.fsm_state, event, trans, args) then
            return
        end
        if false == self:doCallback("on_leave_state", event, trans, args) then
            return
        end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
    end)
end

Test now. I expect to see the failure. I do not.

Arrange for both to be called:

function FSM:event(event, ...)
    local args = {...}
    return self:performAccepted(event, function(trans)
        local specific = self:doCallback("on_leave_"..self.fsm_state, event, trans, args)
        local generic = self:doCallback("on_leave_state", event, trans, args)
        if not specific or not generic then return end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
    end)
end

Test. Fails.

4: Callbacks  -- Actual: false, Expected: true

Now let’s look at the test.

        _:test("Callbacks", function()
            local entered = false
            local left = false
            local tab = {
                initial = "s1",
                events = {
                    { event="go", from="s1", to="s2" },
                    { event="back", from="s2", to="s1" },
                },
                callbacks = {
                    on_enter_state = function(fsm,event,to,from)
                        entered = true
                    end,
                    on_leave_state = function(fsm,event,to,from)
                        left = true
                    end
                }
            }
            fsm = FSM(tab)
            _:expect(entered).is(false)
            _:expect(left).is(false)
            fsm:event("go")
            _:expect(entered).is(true)
            _:expect(left).is(true)
        end)

Well, those expects could be more communicative. But I’m still not clear on what is really broken.

Make the test more clear:

            _:expect(entered,"enter 1").is(false)
            _:expect(left,"left 1").is(false)
            fsm:event("go")
            _:expect(entered,"enter 2").is(true)
            _:expect(left,"left 2").is(true)

Test.

4: Callbacks enter 2 -- Actual: false, Expected: true

Working in smaller steps, we finally see the issue

Ah. I know what the bug is. We returned a falsey, and we’re allowing that. Only a false return is supposed to stop the flow:

function FSM:event(event, ...)
    local args = {...}
    return self:performAccepted(event, function(trans)
        local specific = self:doCallback("on_leave_"..self.fsm_state, event, trans, args)
        local generic = self:doCallback("on_leave_state", event, trans, args)
        if false==specific or false==generic then return end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
    end)
end

I expect this to make the test run. And it does.

Let’s reflect on this, see if we’ve learned anything at all.

Reflection

We’re back on solid footing.

One learning, and it’s a repeated one, is that sometimes I forget to run the tests. When I made that final change last time, I made it doubly wrong, skipping out on true and not on falsey, when it should be skipping out on an actual false only.

Had I run the tests, I’d have seen the failure sooner.

The other learning, and I think it’s more important, is that the tests worked. Even after the big failure was “fixed”, there was still one failing, and that led to the discovery that we needed to check explicitly for false.

You could also make the case that requiring an actual false return is a bit naff, but when a method returns without specifying, you get a nil, which is falsey, and we don’t want to require all callbacks to explicitly return true or false: that’s too much of a burden on all callbacks.

Anyway, we are solid now.

Commit: on_leave callbacks can return false to refuse the transition. callbacks are called specific then general.

Decision

Should we call it a day, or proceed?

Now we need to decide whether to do the refactoring to remove that nested function.

Arguing “against” is that I’m clearly not at my best, at least if we’re measuring by results.

Arguing “for” is that our tests are clearly pretty solid, and we’ve done it before and got it right, and it will make the code better.

Either decision might have been OK.

I’m going in. Let’s see if it’s a mistake.

Things go well, but with learning.

We’re working on this:

function FSM:event(event, ...)
    local args = {...}
    return self:performAccepted(event, function(trans)
        local specific = self:doCallback("on_leave_"..self.fsm_state, event, trans, args)
        local generic = self:doCallback("on_leave_state", event, trans, args)
        if false==specific or false==generic then return end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
    end)
end

We want to return either a transaction or nil into this method, so that we can if out the callback stuff instead of embed it in a function.

Here’s performAccepted. I mention it because I’m going to do this same refactoring in a different, and better, order.

function FSM:performAccepted(event, f)
    for i,trans in ipairs(self.tab.events) do
        if self:fromMatches(trans) and trans.event == event then
            if f then f(trans) end
            return true
        end
    end
    return false
end

We want a new method acceptedTransition, returning either the transition we want or nil. Let’s emulate this function to write it:

function FSM:acceptedTransition(event)
    for i,trans in ipairs(self.tab.events) do
        if self:fromMatches(trans) and trans.event == event then
            return trans
        end
    end
    return nil
end

However, as a smaller change, let’s use it in performAccepted:

function FSM:performAccepted(event, f)
    local trans = self:acceptedTransition(event)
    if trans then
        if f then f(trans) end
        return true
    else
        return false
    end
end

This runs. I found that I had to check for f, which tells me that someone is calling performAccepted besides my method above which is definitely passing in a function.

function FSM:can(event)
    return self:performAccepted(event)
end

That seems bad. It reads as if it would execute the transition, although in fact it will not. Anyway, we’ll soon fix that. Here again, a failing test sent me to finding this situation.

Gotta love those tests.

Now that the acceptedTransition function exists, we can use it:

function FSM:can(event)
    return self:acceptedTransition(event) ~= nil
end

Test. Green. Now, finally, our intended change:

function FSM:event(event, ...)
    local args = {...}
    local trans = self:acceptedTransition(event)
    if not trans then
        return false
    else
        local specific = self:doCallback("on_leave_"..self.fsm_state, event, trans, args)
        local generic = self:doCallback("on_leave_state", event, trans, args)
        if false==specific or false==generic then return end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
    end
end

Despite great confidence in this, I get an error:

2: transitions and messages  -- Actual: , Expected: I am your friendly neighborhood Horn Girl!
And do I have a super quest for you!

Again the tests save me. This is an NPC test, however, which means that a problem has escaped the FSM tests. What it is, is that there is no explicit return of true from the good side of the if:

function FSM:event(event, ...)
    local args = {...}
    local trans = self:acceptedTransition(event)
    if not trans then
        return false
    else
        local specific = self:doCallback("on_leave_"..self.fsm_state, event, trans, args)
        local generic = self:doCallback("on_leave_state", event, trans, args)
        if false==specific or false==generic then return end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
        return true
    end
end

Tests are green. Commit: refactor FSM not to require embedded local function to execute callbacks.

However. Why didn’t our tests detect that event must always return true or false and not nil?

Let’s look at the one that failed, over in NPC:

        _:test("transitions and messages", function()
            local msg
            local npc = NPC(FakeTile())
            local fsm = npc.fsm
            msg = npc:query()
            _:expect(msg).is(npc.fsm.tab.data.messages[1])
        end)

So we found that an empty message was returned, not the expected one:

Check her code:

function NPC:query()
    local worked = self.fsm:event("query")
    if worked then
        return self.msg or ""
    else
        return ""
    end
end

Well, we happen to know she hit the else. If we weren’t sure we’d change the or message or determine otherwise. Anyway, we know for sure now that we need to be sure that FSM;event always returns explicit true or false, never nil.

More accurately, we want a test such that if a transition does occur, it returns true.

We can check this in just one or a few places, because we’re really doing white-box testing here, and we know that the code only has two branches.

Here’s a useful test to modify:

        _:test("FSM changes state", function()
            _:expect(fsm:state()).is("s1")
            fsm:event("go")
            _:expect(fsm:state()).is("s2")
            fsm:event("back")
            _:expect(fsm:state()).is("s1")
            fsm:event("unknown")
            _:expect(fsm:state()).is("s1")
        end)

Extend it:

        _:test("FSM changes state", function()
            local worked
            _:expect(fsm:state()).is("s1")
            worked = fsm:event("go")
            _:expect(worked).is(true)
            _:expect(fsm:state()).is("s2")
            worked = fsm:event("back")
            _:expect(worked).is(true)
            _:expect(fsm:state()).is("s1")
            worked = fsm:event("unknown")
            _:expect(worked).is(false)
            _:expect(fsm:state()).is("s1")
        end)

Comment out our return bug fix:

function FSM:event(event, ...)
    local args = {...}
    local trans = self:acceptedTransition(event)
    if not trans then
        return false
    else
        local specific = self:doCallback("on_leave_"..self.fsm_state, event, trans, args)
        local generic = self:doCallback("on_leave_state", event, trans, args)
        if false==specific or false==generic then return end
        self:doCallback("on_enter_"..trans.to, event, trans, args)
        self:doCallback("on_enter_state", event, trans, args)
        self.fsm_state = trans.to
        --return true
    end
end

Test, expecting failure:

3: FSM changes state  -- Actual: nil, Expected: true

We get the NPC failure as well, of course. Put the line back in.

Tests all green: Commit: add checks in FSM for event always returning true or false.

Let’s sum up.

Summary

Today is one of those days where another author would have scrapped today’s chapter and done it again tomorrow. I do not do that, because I’m at least as interested in things that go wrong as I am in things that go right.

We were changing behavior intentionally, reordering the callback order, and unintentionally, relating to when we return true, false, or nil. The intentional changes broke things, and somehow, I failed to run the tests before shipping.

Chet Hendrickson, talking about tests, describes a horrible conversation one might have with one’s manager:

Manager
How did that terrible error get released? I thought you have all those great tests you keep talking about?
Me
Um, well, er, we didn’t run them.
Manager
#$%$#@!!

Yep, I did that. Had that conversation, with myself. Forgave myself, but recognized that the system, our procedures, let me down, because it’s possible to commit the code without running the tests and having them green.

I do not have an automated fix for that. I can’t even run Codea in the iPad automation stuff, or if I can, I’ve not figured out how. I will redouble my efforts to find a way.

Meanwhile I can just post signs all over the office saying “Ron, run the #$%$!! tests!”.

All that aside, some really good things to think about here:

First of all, we all, with whatever process we may have, can occasionally crash and burn. If we have a pair working, or a mob, it’s much harder to auger in. It would have taken two people, or five, to forget to run the tests. Surely there’d be some obsessive person in the room who always shouts “run the tests”.

Even so, some days, bad stuff happens. When bad things occur, we need to rally around, gather up the shards, and decide what to do.

If the problem is old, we may have to push forward and fix it. If it’s recent, we may do better to revert and start over.

Either way, we know that either we are unstable, or our cleats are worn, or the field is particularly slippery, so that we should proceed in even smaller steps than usual.

This is a common and recurring thing, in my experience. As we begin to feel confident in ourselves and the code, we take larger and larger bites, we feel that we’re “on a roll” and we delay testing and committing. Soon enough, we go too far and BLAMMO explode, crash, burn.

And go back to smaller steps.

But the Tests!!

But wow, didn’t our tests really help us today, especially when we actually ran them. They broke when they should, and they pointed fairly directly to the time we broke them and to the specifics of what was wrong.

They helped us. And, in return, when we noticed a case where a secondary test in NPC caught a problem in FSM, we added a test to FSM.

Bottom Line

It happens to everyone, or at least I tell myself that. Sometimes we’re not at the peak of our game, or something distracts us (darn cat) or something is more tricky than we thought. However it happens, sometimes the bear bites us, and when that happens, we need to regroup and recover.

The form of that recovery seems always to be much the same:

  • More and better tests;
  • More frequent testing;
  • Smaller steps.

Very interesting! See you next time!


D2.zip