Found It!
I’ve been chasing a serious intermittent defect for five or six hours. This is almost unheard of, the way I work. But I finally figured it out.
For the gory details of my mostly fruitless search, you could take a look at NO! A CRASH!. Here I’ll summarize what happened, what I did, what I think I’ve figured out, and then implement the fix. I am not sure even now whether I can write a test for the problem.
It all started, doctor, when I ran the main with an experimental idea of randomizing cell selection during flood, and the program raised an exception and crashed. This is of course a serious problem, so I dropped the morning’s plan to find and fix the problem, and to figure out where my tests were so inadequate as to permit this.
What followed was the longest debugging session I have had for very many years. There are probably a few reasons for that, perhaps some of them to my credit:
-
My code is generally very simple, so crashes, if they occur, always occur and are therefore easy to find. This defect is very rare, so rare that although it has been in there a while I had never seen it crash before.
-
Working in small steps, with many small tests, means that most defects turn up during testing and are therefor easy to find and fix.
-
The problems I use for these articles are generally small, and do not involve threads, asynchronous behavior, or much user input that could break things.
The crash itself stemmed from this code:
class DungeonLayout:
def _straight_path_room(self):
cells = self._straight_path_cells()
if cells:
room = Room(cells, 'path')
return room
else:
return None
The None return was never supposed to happen. I’m not even sure why I checked for the case. In play, the situation is quite rare, but when we get no cells back from _straight_path_cells, we return a None and the program crashes thereafter trying to send a Room message to None.
A quick hack that avoids the crash is to change the add_room method to ignore a None:
class DungeonLayout:
def add_room(self, room):
if room:
self.rooms.append(room)
But while that avoids the crash, I just knew there was something important going on, so rather than just submit the fix and move on, I felt that I needed to figure out what was really going on. And thus started five or six hours of frustration.
I spent too much time putting in prints and then running the game until it crashed, so as to get whatever bit of info I may have wanted.
- Note
- It is my almost inviolable practice never to use the debugger. I prefer to work with prints, and consider even that to be a bit of a defeat. I avoid the debugger because, for me, it tends to devolve into long periods of testing, interspersed with a curse word and starting over because now I want to see something that happened three steps ago.
-
I found that when debugging I was more just blindly stepping and hoping that I’d see something. With prints, I tend to want to be finding out something.
Each new print cost me probably one or two minutes of restarting the game until it crashed, although I finally found a key-sequence that wasn’t too hard to type to run run run until the crash display came out.
Early on I found a hack that made the program keep running, basically just not stuffing a None in where a Room belonged, I was sure that that hack avoided the crash, but that it happened at all seemed wrong, so I wanted to be sure. And I still couldn’t write a test to show the defect, because I didn’t know the situation that caused the empty cell list.
I spent too long running until I got the crash, and finally got the idea of priming the random number generator and then changing the priming value until I reliably got a crash, I started priming at 1, and got a reliable crash at seed 35. Now we’re cooking!
After too long, I realized that that list of cells could only be empty if the start and target cell of a would-be path were adjacent, so the path itself would be zero length. But if two rooms are directly adjacent, they should be found to be a “Suite” and we would not draw a path between them.
The code for all this is obviously correct, and besides that, it actually is correct. I began to print the path length and quickly found the zero-length path coming out:
Room(path: 4378954016):Cell(15, 50) - 0 - Room(path: 4378954016):Cell(15, 50)
By that time I was printing the id of the objects, so as to tell them apart, and curiously enough, we are clearly trying to draw a path between a cell and itself. But why and how?
The path-drawing code generates a new set of Suite instances, where a Suite consists of all the rooms, starting in a given room, that can be walked to without stepping outside into available space: all the rooms contiguous to the one where we start. We then create the next Suite by starting in a room which has not already gone into the preceding suites:
def define_suites(self):
suites: list[Suite] = []
unexplored = self.rooms.copy()
while unexplored:
suite: Suite = self.find_suite(unexplored[0])
suites.append(suite)
unexplored = [room for room in unexplored if room not in suite.room_set]
return suites
Further printing told me that right before the crash, we were down to two suites, and that they both contained the same room, a path room.
I suspect that if I had taken a break right then, I might have found the issue sooner, but, with my brain buzzing with ideas, I thrashed around a bit. In particular, when the process started, the existing path room was in there, but it was removed from the unexplored list, yet found by the second time through the loop.
At this point I was thinking that I would need to change the find_suite to consider only rooms in the unexplored list: it was accepting any adjacent room.
But the thing is, that should be impossible anyway. If the path were adjacent to the rooms in the first Suite and adjacent to the cells in the second Suite, then, as a room adjacent to each, all the rooms in question should be in one single Suite, not too.
This caused me to focus on the notion that possibly drawing the first path caused it to be adjacent to a room that wasn’t the target, but coincidentally made a connection. Trying to imagine how that could happen had me chasing my tail for a while. I still suppose it might be possible but a look at the map drawn made me sure that it wasn’t what was happening.
Somewhere in there, I added code to tag each cell with an indication of what room it was in, and then trimmed that to do that only for the path cells. Here is the picture. The tiny number in each path cell is the last two digits of its id.

We can easily (OK, expand the picture and it’s at least possible to) see that all the path cells have the same id. That is not terribly surprising because the cell computation shown above returns all the path cells in a single room. I have a card in front of me that asks whether paths should be in one room or many, and its box is checked, meaning that at some point I must have thought about it and decided it was OK. The implication of this escaped me for quite a while.
I remained daunted, until later in the day, in another world, I mentioned the problem, explained it to my friends, then went on to think mostly about other things.
Then a glimmer. Suppose that before the picture above was drawn, we had no path cells where the numbers are. And suppose we found those two rooms at the top were isolated, and drew that little path between them. And then later in the same loop, we found the rooms down at the bottom that were separated, and drew the paths down there. All those paths are in the same Room.
And suppose we were not still connected. When we define suites, we get a Suite at the top right with those two rooms, plus the path as the Suite. Then, looking at the bottom area, we build a big Suite and again we find the path down there. It’s the same room and so now we have two Suite instances containing the same room.
Then, when we come around again, to try to join some two Suite instances that have the same room in them, we look for the two closest cells to join and sure enough, we return a cell from the room they hold in common for each end of the path. Cell(15,50) is in both Suites and is therefore the closest pair. We draw a zero-length path and crash.
There’s yer problem right there, bud, there’s yer problem right there.
So over a period gently musing about the situation, I finally latched onto what would happen if the same room was in two separate suites, and after realizing that the picture above made that possible, I believed (and still do believe) that I understood the defect entirely, for large values of entirely.
The Fix
The fix will be to create each path chunk as its own Room, which will, I am very certain, cause this particular layout run with the None check removed. (We may want to leave it in but my gut says we shouldn’t have to check that: people shouldn’t be adding None to the Layout. I don’t generally write defensive code: I write code that doesn’t offend. Usually.)
The change needs to modify this code:
class DungeonLayout:
def ensure_connected(self):
count = 0
while not self.is_fully_connected and count < 10:
count += 1
room = self._straight_path_room()
self.add_room(room)
def _straight_path_room(self):
cells = self._straight_path_cells()
if cells:
room = Room(cells, 'path')
return room
else:
return None
def _straight_path_cells(self):
cells = []
suites = self.define_suites()
for s1, s2 in zip(suites, suites[1:]):
cells.extend(s1.find_path_cells(s2))
return list(set(cells))
(The actual code right now is 40 lines. I didn’t show my debugging print loops.)
We want to create a separate room for every pair in the zip. As written, ensure_connected assumes that we’ll just create one path room, so we’ll need to do this a bit differently. I was briefly tempted to do something fancy, like yield the cells, but that’s too fancy. I think I’ll just edit this to do what I need: I don’t see tiny steps for it. I’ll commit a save point. The code is full of prints but otherwise works, with all tests passing. Now:
def ensure_connected(self):
count = 0
while not self.is_fully_connected and count < 10:
count += 1
self._make_path_rooms()
def _make_path_rooms(self):
suites = self.define_suites()
# self.check_intersection(suites) # debug method
for s1, s2 in zip(suites, suites[1:]):
cells = s1.find_path_cells(s2)
if cells:
room = Room(cells, 'path')
self.add_room(room)
else:
# supposed to be impossible
print(f'found empty path! {len(suites)=}')
for suite in suites:
print(suite.room_set)
This is working with the add_room hack removed, so the crash is fixed. One test is failing:
def test_create_path_room(self):
Cell.create_space(10, 10)
dungeon = DungeonLayout()
c_00 = Cell.at(0, 0)
c_40 = Cell.at(4, 0)
c_44 = Cell.at(4, 4)
c_04 = Cell.at(0, 4)
room_00 = Room([c_00], 'room_00')
room_40 = Room([c_40], 'room_40')
room_44 = Room([c_44], 'room_44')
room_04 = Room([c_04], 'room_04')
dungeon.add_room(room_00)
dungeon.add_room(room_40)
dungeon.add_room(room_44)
dungeon.add_room(room_04)
path_room = dungeon._make_path_rooms()
assert len(path_room.cells) == 9
This is not working because _make_path_rooms no longer returns a room. And the test is assuming that we get only one path between the four corner rooms. We should now have three path rooms of length three.
Let’s see if we can modify the test sensibly:
path_rooms = [room for room in dungeon.rooms if room.name == 'path']
assert len(path_rooms) == 3
for path in path_rooms:
assert len(path.cells) == 3
We could check the individual paths but that’s enough to convince me. For now. I may be rushing.
Now I’m going to strip out my debug prints and commit this baby for real.
When running to see if all my prints are gone, other than the ones I’ve left in on purpose, I get this report:
found empty path! len(suites)=6
{Room(unknown: 4382402048), Room(diamond: 4332005392), Room(cave: 4382402096), Room(cave: 4382401664), Room(cave: 4382401712), Room(unknown: 4382402240), Room(unknown: 4382401760), Room(unknown: 4382401808), Room(round: 4382401328), Room(unknown: 4382401856), Room(cave: 4382401904), Room(unknown: 4381394368), Room(unknown: 4382402000), Room(cave: 4382401520)}
{Room(diamond: 4382401376)}
{Room(round: 4382401424)}
{Room(cave: 4382401568)}
{Room(cave: 4382402192), Room(unknown: 4382401616)}
{Room(cave: 4382401952)}
Now I do kind of suspect that the adjacency thing could still happen: we are connecting room A with B but our path happens to touch C, resulting in a surprise connection.
I think we can improve that print:
found empty path! len(suites)=6
source=Cell(18, 38), target=Cell(41, 6)
A look at the map and some tedious counting tells me that this is a legitimate empty path, trying to connect from the lower left to the upper right, and there is no path that can get there without crossing another room or path. (Our path between Suites is limited to the source room, the target room, and available cells.)
So the empty path can occur legitimately. Change that code:
def _make_path_rooms(self):
suites = self.define_suites()
# self.check_intersection(suites) # debug method
for s1, s2 in zip(suites, suites[1:]):
cells = s1.find_path_cells(s2)
if cells:
room = Room(cells, 'path')
self.add_room(room)
# cells can be empty if no path is possible
# because we only accept source room,
# target room, or available cells.
I don’t often comment my code but in view of my recent confusion, this seemed suitable.
I think we’re good. Green, main runs, commit: fix problem creating None as a room, resulting in a crash.
Let’s sum up.
Summary
I am serious when I say it has been years since I’ve had that long and confusing a debug session. (Of course, I am old and I may have forgotten, but they are certainly rare.)
As I mentioned at the top, that’s partly due to the simplicity of the problems I address, but I think it’s fair to say that a lot of it is due to the very small steps I take, almost every one driven by a new test made to work. In that mode, most defects are caught immediately after they are made.
But not every one, as we’ve seen here. I do encounter situations where I don’t realize that something is broken until I run the whole program, which generally shows that there is a test that I could have written but didn’t, for whatever reason, didn’t think of it, couldn’t quite see how to test it, or was lazy.
That last one, lazy, is quite common. I am not a perfect programmer and I’m not here to tell you that I am, or that you should be. I am here to show you what happens when I work as I do, and to try to draw lessons for myself from it. If you just enjoy the laughs, that’s fine, and if you get ideas about your own process, that’s good. If they sometimes match my ideas, wow, that’s wonderful!
Looking back over this process, the main thing that I wish I had done sooner was to find the reproducible case by pinning the random number generator. I don’t know that that would have saved me tons of time but it would have saved perhaps 15 minutes, 30 at the far outside. It might, however, have given me faster insights had I not had to stop thinking and start clicking for a while.
Even if it saved very little time, it was helpful to have the reproducible version because the same situations arose every time, so that targeted printing was always looking at the same situation. That meant that I didn’t encounter spurious results that weren’t germane to what I had looked at moments before.
The trick of putting the path id into the cells was useful. I probably spent 15 to 30 minutes trying various things until I got the info I needed, after thinking that the last two digits of the id might be a good indicator.
The essential insight is that, as originally done, a single path room could be discontinuous and thus join two sets of rooms into suites without forcing them to be one suite. My implicit assumption was that if there was a room in common the suites should coalesce into one. Since there were two suites, I went down many wrong paths trying to see how there could possibly be a common room.
I realized just now that at some points I was unaware that the path finding code was only considering the source and target rooms as legitimate travel points, other than freshly available cells. That had me considering whether I need to add the path room to unexplored, which I never did, but thinking along those lines probably cost me some time.
This makes me think about, oh no …
- LLM-“AI”
- With the exception of some simple line completion from PyCharm, I have written every line of this program, and I have read every line, most of them multiple times. I have a very good understanding of how this program works, in detail. My memory is of course limited, but by and large, I know how this program works.
-
If I had “vibe-coded” it, or otherwise allowed an “AI” to write it, I would surely not know the program as well. If you think I had trouble with this defect in code that I wrote and understand rather well, imagine my chances with a program that I didn’t understand.
-
I feel no particular need to be fair to “AI” but it is only fair to observe that in a team situation, with programmers working alone on various tasks, there might be at most one developer who was qualified to dig into this issue, with the others behind the wave on this topic. In a pair programming or mob programming situation, we’d have more developers who would have a shot at this problem.
-
Something to think about.
Would it have helped me to have someone to pair with, even someone who didn’t know how things work, or maybe even especially someone who didn’t know how it works? Almost certainly. Even a rubber duck is a decent help. Writing these articles is a help, because I might explain something to you and thus come to a realization. I suspect that in yesterday’s screed there isn’t as much explanation as simple tracing of my thrashing. So a human asking questions might well have helped.
No such humans were available. Finally, having eliminated the impossible, what remained was the realization that a discontinuous room has weird properties vis-a-vis neighboring rooms. The fix was pretty simple and in fact resulted in fewer lines of code than the original. Curious how that happens.
Lessons for me? I’m not sure. Keep on doing the good stuff, I guess. I really feel that this time it’s just that the bear bit me. I’m not satisfied with that answer but I don’t see a place over the past few days where I went wrong.
Possibly more testing of the suite path making would have helped. The fact that I still can’t think of a decent test makes me think that there must be some fuzz in my concepts there.
A weird few days. I emerge, bruised but victorious. May you do the same, ideally with fewer bruises!
See you next time!