Dungeon 167 - BUSTED!
I shipped a defect yesterday. And there was a test for it! What happened?
Last night I wanted to try an idea that Bruce Onder suggested, which was to turn off the crawl and see how the game feels without it. I was sure it would be horrid, and it was. However, I also discovered a serious defect.
Whenever a monster other than Mimic gets in a battle, the game crashes!
It took me a while at my backup iPad to find the problem, and it was this:
function Animator:defaultAnimation()
return self.moving
end
That should have said:
function Animator:defaultAnimation()
return self.animations.moving
end
When I fixed the problem, three test checks failed:
_:test("Animator defaults", function()
local entry = {
dead={"dead"},
hit={"hit"},
moving={"moving"},
}
local animator = Animator(entry)
_:expect(animator:getAnimation("attack")).is(moving)
_:expect(animator:getAnimation("hide")).is(moving)
_:expect(animator:getAnimation("idle")).is(moving)
end)
That should have said something like this:
_:test("Animator defaults", function()
local mm = {"moving"}
local entry = {
dead={"dead"},
hit={"hit"},
moving=mm,
}
local animator = Animator(entry)
_:expect(animator:getAnimation("attack")).is(mm)
_:expect(animator:getAnimation("hide")).is(mm)
_:expect(animator:getAnimation("idle")).is(mm)
end)
The defect in the test is that when it asks for moving
, it’s referring to an undefined global variable, i.e. nil, and the code was referring to Animator.moving, which is also nil, so the test ran fine. I must never have seen it fail.
If I recall, I wrote the default method first and then wrote the test for it second. The test worked, which didn’t surprise me because I mistakenly thought the code was obviously correct.
I have a better idea this morning for that test.
_:test("Animator defaults", function()
local entry = {
dead={"dead"},
hit={"hit"},
moving={"moving"},
}
local animator = Animator(entry)
_:expect(animator:getAnimation("attack")).is(entry.moving)
_:expect(animator:getAnimation("hide")).is(entry.moving)
_:expect(animator:getAnimation("idle")).is(entry.moving)
end)
Still works, and it’s a bit more clear about what we expect to get back.
Commit: fix defect and improve test for Animator:defaultAnimation.
I already fixed the zip file, last night, so that my users, if any, wouldn’t encounter the defect.
Let’s reflect.
Reflection
If I recall correctly, I probably had a test for getAnimation
, and when I wrote it, I realized there was the possibility that the animation wasn’t present and that there should be a default, namely the “moving” animation, which every monster is required to have. Now that I think of it, let’s look at the code involved:
function Animator:init(mtEntry)
self.animations = {}
self.animations.dead = mtEntry.dead or error("need dead animation")
self.animations.hit = mtEntry.hit or error("need hit animation")
self.animations.moving = mtEntry.moving or error("need moving animation")
self.animations.attack = mtEntry.attack
self.animations.hide = mtEntry.hide
self.animations.idle = mtEntry.idle
self.animationTimes = mtEntry.animationTimes or { 0.23, 0.27 }
self.animation = Animation()
end
function Animator:getAnimation(name)
return self.animations[name] or self:defaultAnimation()
end
function Animator:defaultAnimation()
return self.animations.moving
end
This is too clever. In init
, we know we have an animation named “moving”, because we just checked it. There is no reason in the world to blindly store the next three, and then later, in a lazy fashion, deal with them being missing.
Here’s what would be better:
function Animator:init(mtEntry)
self.animations = {}
self.animations.dead = mtEntry.dead or error("need dead animation")
self.animations.hit = mtEntry.hit or error("need hit animation")
self.animations.moving = mtEntry.moving or error("need moving animation")
self.animations.attack = mtEntry.attack or mtEntry.moving
self.animations.hide = mtEntry.hide or mtEntry.moving
self.animations.idle = mtEntry.idle or mtEntry.moving
self.animationTimes = mtEntry.animationTimes or { 0.23, 0.27 }
self.animation = Animation()
end
function Animator:getAnimation(name)
return self.animations[name]
end
Now this does feel a bit more scary. “What if there is no animation by that name. The “or default” thing is comforting.
But making sure during init
that everything is proper is better.
I think what I’m going to do is do both. Correctly initialize, but also provide for the default.
function Animator:getAnimation(name)
return self.animations[name] or self:defaultAnimation()
end
Belt and suspenders? Or needless and confusing duplication of effort? Honestly, I’m not sure.
What I am sure of is that I put a defect into the code right here, and even though I wrote a test for it, the test, and the code, were both wrong.
Usually, when both the code and test are wrong, I find that the test breaks. Here, in Lua, this may not hold true as often, because of Lua’s willingness to provide a nil whenever you reference something no one has ever heard of. I keep saying that there are ways to change that behavior, but so far I haven’t done it.
One advantage to writing an article every time I code is that I can get a clear view of history. What happened was this:
I implemented a test for these defaulted animations back when there were set up in Monster. And that test was wrong. It did fail, looking at first for defaultAnimation
. I implemented that method, also incorrectly, as we saw above. But the test passed, because they both had similar mistakes, so I wound up comparing a nil to a nil.
After that, as I installed Animator/Animation, I relied on the test, assumed that the code was good, and left off the or
clauses on the initialization, as redundant. Except they weren’t, they were the only reason the code could work.
But the tests ran. And, as one should, I believed them.
But this time, they lied.
I think I’d say that I followed our standards of care pretty well here. I wrote the test, I made it work, and I relied on it later while moving the code around.
Does that mean we have nothing to learn? Well, not quite.
The tests included
something.is(moving)
Where moving was an undefined global. If Lua didn’t allow access to undefined globals, the test would have errored, we’d have seen the flaw, we’d have fixed it, then the test would have failed because the code was (also) wrong, and we’d have fixed that.
Lua is to blame: it allows access to undefined globals.
I believe there is a well-known way to change Lua so that this cannot happen. Let’s try it.
Deny Access to Undefined Globals
There is a nice page on the Lua Wiki. The solutions offered there seem more invasive than I’d like. In particular, Our code does define some global variables during the run, such as Runner
, which contains the GameRunner instance. There may be others.
We could argue that we shouldn’t do that, and that if we want some well-known objects, we should encapsulate them in a single global that contains them as fields.
So maybe we should try this thing and see what happens.
But there are other options to read about.
On Lua.org, there’s a page about Defining Global Variables. It provides a very simple patch of code:
setmetatable(_G, {
__newindex = function (_, n)
error("attempt to write to undeclared variable "..n, 2)
end,
__index = function (_, n)
error("attempt to read undeclared variable "..n, 2)
end,
})
Let’s paste that into our setup and see what happens.
function setup()
if CodeaUnit then
runCodeaUnitTests()
CodeaTestsVisible = true
end
local seed = math.random(1234567)
print("seed=",seed)
math.randomseed(seed)
showKeyboard()
TileLock = false
Bus = EventBus()
Runner = GameRunner()
Runner:createLevel(12)
TileLock = true
if false then
sprite(xxx)
end
setmetatable(_G, {
__newindex = function (_, n)
error("attempt to write to undeclared variable "..n, 2)
end,
__index = function (_, n)
error("attempt to read undeclared variable "..n, 2)
end,
})
end
Well, the main thing that happens is that Lua crashes hard, right out to the desktop. So that won’t do, will it?
I’ve filed a report to the Codea forum. Didn’t I try something like that once before?
Yes. There is a global variable collide
, which is undefined but repeatedly accessed by Codea in the draw loop. Setting it to false in my code fixes the immediate crash, but I’ve seen other crashes.
I managed to use the __index
part of the patch above long enough to find a couple of problems in my code. Here’s one:
function Tile:drawLargeSprite(center)
-- we have to draw something: we don't clear background. Maybe we should.
local sp = self:getSprite(self:pos(), tiny)
pushMatrix()
pushStyle()
textMode(CENTER)
translate(center.x,center.y)
if not self.currentlyVisible then tint(0) end
sp:draw()
--[[
if self.roomNumber then
fill(255)
text(tostring(self.roomNumber), 0,0)
end
--]]
popStyle()
popMatrix()
end
The reference to tiny
at the top of that method is global. Not good. Changed it to false
.
There’s another global in TileArbiter.
TileArbiter = class()
TA = TileArbiter
TA_table = nil
This table is filled in with a lazy init:
function TileArbiter:createTable()
-- table is [resident][mover]
if TA_table then return end
local t = {}
t[Chest] = {}
...
TA_table = t
end
Whenever we create a TileArbiter, we call createTable, but it only executes once. Unfortunately, the code references the global. We could fix that with a single “top global” that we’d initialize and then use for things like this. We might wind up wanting to protect it against references too … but less often.
I can sort of fix this problem. Let’s try, just to see what else we can find. The concern is that we can’t just define a global variable as nil
because that basically undefines it. We have to set it to something.
Let’s do this:
function TileArbiter:createTable()
-- table is [resident][mover]
if TA_table[Chest] then return end
local t = {}
...
And define TA_table:
TA_table = {}
And run some more. Even if we can’t leave this check in the system, it’s finding interesting things.
I observe that with the patch where it is now, the tests are not protected against reference to globals. That means that, so far, we would not detect the defect we fixed this morning.
With the TA_table
change in place, I can play the game and get no more diagnostics. I think I’ll plug in the setting half just to see what it finds. It finds nothing. I don’t think I dare leave it in, it’s too likely to turn up something and crash a user, but it’s worth at least turning on for testing sometimes.
I’ll comment it out but save it. We may be able to make good use of it. Commit: include commented-out code to protect against undeclared globals. And TileArbiter fix global not to be nil.
I suppose if I were a fanatic, I’d have done that in two separate commits. But I’m not that kind of fanatic.
Let’s sum up.
Summary
We reported on a defect from yesterday, and improved the short-term fix I made to the tests. The problem was exacerbated by access to an undefined global.
We tested some Lua code to prevent access to undefined globals, and found that Codea still accesses the undefined variable collide
. We dare not leave that code in, lest it break something for a user, but running it we did find two issues involving globals. One was a reference to tiny
which probably resulted from a “refactoring” that moved some code out from under a parameter. The other was a lazy-initialized global table that was intentionally defined as nil to start. That’s incompatible with the global checker code, so as a simple fix we initialized it to an empty table and checked it for an entry.
We would do well to define a single “top” global and keep all our other globals in it. This would help insure that we don’t define any late in execution, which would make it safer to have something like the checker code we looked at today in place all the time.
Optionally, we could allow the reference but log an error. In a real product, we might well want to do that. Both of our accesses to globals were in fact “OK”, in that neither one of them broke anything. A global checker finding those two probably shouldn’t bring down the system.
Error logging is pretty much beyond the scope of a game with somewhere in the range of zero users, although it could be an interesting exercise. But where would I log the errors? Send myself emails? Put Oracle up on my site? Get an S3 instance for error logging? None of these appeals to me much, I have to admit.
Bottom line, we found and fixed a bug, we improved some code, and we learned a bit more about checking for improper global access. Along the way, we get a good idea about avoiding them as much as possible with a “top” global.
All in all, not bad for Sunday before brekkers.
See you next time!