Everything is going swimmingly, renderer pumping out frames 60 times a second, audio hitting all the right notes, physics keeping solid objects very definitely solid… and then the wheels come off. One second it’s 150kph down the straight, the next it’s a closeup of the side-wall of a ditch and a distinct smell of burning. There’s only one description for it – it’s the “well, I’ve never seen it do that on my machine” moment.

The problem, generally, with such moments is that they rarely have the grace to occur in a vaguely controlled environment, much less one attached to a debugger. Some systems have built-in crash screens, but many don’t and even on those that do the data is often not what you need. And thus, you inevitably have approximately 15 seconds to look at the hideous mess filling the screen and tell your producer that it’s definitely, absolutely a bug that you’ve already fixed but which isn’t in this build yet. As it’s much easier to lie about these things convincingly when you have at least some vague concept of What The Hell Just Happened, a good strategy for such moments of epic fail is a must.

One of the best (for, er, certain values of “best”) solutions I’ve seen to this problem came many years ago on a PS2 project, where the process of printing debug information to the screen was so flakey that we had resorted to changing the TV overscan border colour to represent different types of crash (red = exception, blue = out of memory, etc). Changing palette entries or the border colour on every scanline to generate colour gradients (“raster bars”) was a popular technique in the 16-bit demo era, but by that point something of a fading memory… to all but one programmer who realised that not only could the PS2 do the same, but since in a crashed state the entire undivided power of the CPU could be put on the task, he could swap the border colour fast enough not just to change it per-line, but within a line. And thus, the “huge (somewhat jelly-like due to timing inaccuracies) letters written in the border of the TV image” crash screen was born – it could fit about 8 characters on each side, just enough for an address and a few key notes on what had happened. Crashing had become a spectator sport.

Figuring out that all is not as it should be

Of course, one of the rather awkward things about random crash bugs is that whilst it’s generally obvious to an external observer that something has gone wrong, it’s not actually necessarily quite so easy for the game itself to tell. There are a few useful tools that can be deployed here to try and catch these situations:

Exception handlers

Probably the single most useful tool in the debugging arsenal is the ability to set a custom exception handler. The exact mechanism for doing this varies from system to system, but the general principle is universal – you register a function which will get called if something triggers an exception on the CPU. CPU exceptions trap low-level failures – reading or writing an invalid address, executing a non-existent instruction, that sort of thing, which is usually where you end up after a null pointer dereference or similar bug.

As such, exception handlers are very powerful – set up correctly, they can catch any state which would normally cause CPU execution to halt and give you a chance to do something. The downside to this is that they are a highly fiddly beast at times – for example, expect that your exception handler will likely need to provide its own stack if it does significant work, and that interrupts will be disabled on entry (quite possibly in a manner such that re-enabling them whilst inside the handler is either impossible or suicidal). Hence, writing anything that looks like “normal” code inside one of these is an awkward business.

Assertions

I’m going to go out on a limb here and assume that everyone is already using assertions – they can be a very easy-to-tap source of information about what went wrong. When an assertion fires you already have (at a minimum) the file/line number in question, and depending on how you have things set up potentially a witty message from whoever wrote the system lamenting whatever went wrong. Even better, when you hit an assertion, you are probably still in pretty good shape – nothing has gone badly wrong… yet. Which means that there’s less need to worry about the system falling apart around you when you try to report this fact…

Timeouts

The humble timeout routine can be surprisingly useful in a wide-range of circumstances – most notably when dealing with hardware devices such as GPUs, network traffic or even simply code which “may not be guaranteed reach a terminating point within finite time” (everyone has at least one of those functions somewhere… just try to make sure it isn’t called more than once a frame or so and everything will be fine…). Often you can recover from a timeout condition without throwing a fatal error, but sometimes it will just have to be a toys-out-of-the-pram moment.

Watchdogs

A close relative of the timeout, a watchdog is a separate thread, interrupt or similar which polls a task at regular intervals to make sure it is getting somewhere. The simplest version of this is to have a counter which you increment in the watchdog at set intervals, and reset to 0 at the start of your main loop. Then if the watchdog counter ever hits a reasonably high number (say 10 seconds without the main loop resetting it), it’s a reasonable bet that something has gone into an endless loop. If supported by your architecture, the watchdog may be able to snoop the current state of the main thread and find out exactly where this is, or it may simply have to put up a message saying “something broke” – but at least even that provides some information as to the nature of the problem.

What do you want to get out of this?

So now that you know that things have gone wrong, the next step is to gather as much useful information as possible. Obviously in an ideal world you probably want the entire machine state – and indeed in some cases it is perfectly possible to do this by saving a dump of the contents of RAM and various registers. However, even if you can extract this level of information, a few more digestible facts make for a much easier first look at a problem:

Where

The single most important piece of information you can get for debugging purposes is generally “where did we crash?”. This is even more important than “how did we crash?”, for the simple reason that 90% of the time there is only one way any given line of code can reasonably actually go wrong, so knowing “where” already answers “how”.

In most cases (assertion failures being the obvious exception, if you’ll pardon the pun), the information you’ll be able to get about the program’s point-of-failure will be the Program Counter address. This can be converted into a human-readable location in source code by looking it up in the map file for the executable (you do keep copies of the map files and debug symbols for every version you release, don’t you? ^-^). This is usually best done offline (just get the address out of the game and then look it up by hand or using a tool application), but can be integrated into the game itself if you are prepared to spend the time (and don’t mind the lookup code running the risk of crashing itself).

Whilst the PC value alone is invaluable, if possible it can also be very beneficial to get a stack trace from the point of the crash. Walking the stack is a bit of a black art, and very platform-specific, but the basic principle works like this:

  1. Start from the current PC address
  2. Walk backwards through the instruction stream, noting any manipulation of the return address register/link register and the stack pointer
  3. When you reach the function entry point, calculate where the return address register value that existed when the function was called currently lies (either in a register or on the stack), and retrieve it
  4. Move up to that call point and repeat from step 2 until every function in the stack has been traversed (or some depth limit is reached)

On some architecture/compiler combinations, stack walking can be done with almost 100% accuracy, on others it is something of a crap-shoot (especially in optimised builds). Link-time code generation, in particular, can muddy the waters even more. Sometimes the SDK for your platform will provide a stack walking functions – if you do decide to roll your own it is wise to take as many precautions as possible to ensure it does not itself crash due to bad data (remember that you got here because something had already gone wrong…).

How

In most instances the answer to “how did we crash” is very short, very simple, and virtually useless. In the case of a CPU exception, there will generally be information passed to the handler about the nature of the exception – invalid data access, invalid instruction and a handful of others are the usual candidates here. For an assertion, timeout or other homemade construct, then you can be a bit more verbose. At any rate, though, there isn’t a lot more you can do other than report what you were told and move on.

Why

If someone could write a function which examined the state of a crashed program and told the user exactly why it crashed, that person would be deified as a programming genius, lynched for putting an entire industry out of work, or both simultaneously. Until that day, though, the best we can do is assemble a collection of clues and try to establish the cause of the crash from those.

This is a pretty application-dependent problem, as the design of other systems will have a big influence on what constitutes “useful information” in debugging any particular issue. However, a few good candidates are:

  • The values in CPU registers and on the stack
  • printf() or similar output leading up to the point of the crash (save everything you print in a ring buffer for exactly this purpose)
  • Used/free memory
  • State of hardware devices such as GPUs
  • Current location and status of threads, and associated mutexes/semaphores/etc

Ideally, you want enough information to be able to discern the problem from it alone, but if that isn’t possible then at least enough to get a sense of the circumstances under which is happened, so hopefully you can repeat it in a more controlled environments.

Message in a bottle

So, you have all the information you want to report… but the problem is that by the time you have it, the game has already crashed and in all probability “normal” methods of displaying data will be flakey at best. It’s time for a bit of lateral thinking.

First of all, start by enumerating all the output devices you have at your disposal. Usual candidates are:

  • Screen display
  • Audio output (often too fiddly to be really worthwhile)
  • Writable storage
  • Network connection
  • Debug-use LEDs
  • Not-debug-use LEDs that are nonetheless hijackable
  • Rumble pack (I’m sure someone somewhere has tried to use one for debug output)

The problem with output devices tends not to be a lack of them, but the complexity of using them. This can vary radically between hardware platform, but as a general rule the simpler something is to access, the better – that means you need less code, and there’s less chance of something breaking unexpectedly. The best case is that a high-bandwidth output (like the display) can be controlled easily from the CPU directly.

The key here is to think as dirty as possible. Don’t create a primitive data buffer to hand to your rendering pipeline and GPU, because the chances that something will go wrong along the way are very high. Instead just forcibly terminate any active DMA channels, shut down the GPU and write pixels directly into the framebuffer using memcpy() (assuming you have access to the framebuffer, of course). It’s ugly and stupid but it works (and you’d be surprised how much fun it is to write).

My favourite trick for screen output is to make a 3×5 pixel font and embed it in the executable – you can do this with a simple macro like this:

1
 
  2
 
  3
 
  
#define FONT_CHAR(char_idx, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) 
 
   ((char_idx << 16) | a | ( b<<1) | (c<<2) | (d<<3) | (e<<4) | (f<<5) | 
 
    (g<<6) | (h<<7) | (i<<8) | (j<<9) | (k<<10) | (l<<11) | (m<<12) | (n<<13) | (o<<14))

And then draw a 4-bytes-per-character programmer-art font directly into a CPP file like this:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  
u32 my_font[] =
 
  {
 
  FONT_CHAR(‘A’,
 
  0, 1, 0,
 
  1, 0, 1,
 
  1, 1, 1,
 
  1, 0, 1,
 
  1, 0, 1),
 
  …and so on…
 
  };

For outputting to more limited devices such as LEDs, you can output several pieces of data by cycling between them (simply use loops with NOPs in them for timing), or you may be able to read pad input (or debug-use buttons) to manually page through the available data. If a filesystem or network device is available, then it may be possible to use that, although on some systems getting such IO to behave itself after a crash can be somewhat fraught (don’t forget to flush your buffers!).

It can often be a good strategy to output data in stages – get the things that are most important, and also least risky, out of the way first, and then work on the rest. For example, if you want to do a stack trace, get the current PC value and other data in screen or into a file first, and then do the stack walk. That way if your code crashes for some reason, you at least have the first set of information to work from. If you’re doing external QA testing, then you may also want to consider a “condensed” crash screen – whilst the testers may have an easy way to get screenshots (or a digital camera), if they are transcribing the contents of every dump by hand they will quickly start to resent you for wanting the entire register set of all the SPUs…

And so, the next time your game falls flat on its face, it can leave a note identifying the culprit in bloody handwriting. Just don’t forget to take the code out again for your final build – first-party QA generally find processes that involve forcibly halting the entire system and putting the entire CPU to work drawing crayon-like words down the side of the screen less amusing that you might expect…