Debugging hard to reproduce bugs

Instapaper Text

Debugging hard to reproduce bugs

Debugging a game can by quite frustrating some times. What makes it so is the fact that in many cases reproducing a crash can be impossible. How can you pinpoint a memory corruption bug that causes a crash many frames after the incident? How can you replay the exact game until the point right before the crash, then a bit earlier, until you can see what went wrong? You can never play the same game twice and even single a frame lasting longer can have dramatic changes to the execution and the point of crash.

The system I use for such cases is a full replay system. To be able to reproduce a bug, you must be able to replay the exact game. And by game I mean the whole session from execution start, all the way to the point of crash. To do so you must think of the game engine as a deterministic black box that has specific undeterministic inputs. If these underterministic inputs are logged, we can reproduce the exact execution any number of times we like.

The key here is to understand these inputs. The most obvious input is the user input, either that being the controller, or the mouse, or the keyboard. However these are not the only inputs that add randomness to the system. There are some more things that need to be consider as inputs in order to get to a full deterministic behaviour. If your game uses random numbers, the random number generator is an input. You must either log it, or use the same seed as you did in the crash. If your game queries the system time (of course it does!) then you have an other input! We are basically looking for ever point that the game engine reaches out of itself and get data, or it is fed data.

So the inputs are:

User controller input
Random number generator
System time

You mush hook a logging mechanism to all these inputs and record every single event. When it is time to replay a sequence, you start the game with a log replay mechanism at these hooks, in a way that all calls are answered with the same values as in the original play. This way you will follow the exact execution path that was followed when recording, but this time its like you know the future, and you can stop at any time before the crash to investigate. All these of course hold true in a single threaded environment. You must not allow the undeterminism of the OS’s scheduler get in the way. Tests should be done in a single thread lock down mode. If the crash is not reproducible at all in single thread runs, it is probably a multi-threaded race problem in the first place, and all I wish you is god help you! ;)

#AltDevBlogADay

Charilaos Kalogirou
Follow @harkal