Everything’s fine, everything’s dandy, your game is humming along at
60fps, when suddenly (oh no!) it crashes! You attach a debugger but,
oh no, this isn’t right… this isn’t right at all! If you’re lucky
you might have some (albeit wrong) source code information, but more
than likely you’re lost in a sea of assembly.
Don’t panic! Any sudden movement will startle the game, forcing it to
attack. Well, not really, but before you abort and start up that debug
build, take this opportunity to poke around and see what you can first
find out.
Regardless of what platform you’re developing for you’ll want to do
some reading about its ABI. The ABI (application binary interface)
describes how the stack is laid out, how and what individual registers
are used for, how parameters are passed to functions, etc. Many are
available free with a bit of digging, for example Microsoft’s x86 ABI
can be found on
MSDN, and the
64-bit PowerPC Linux ABI is also freely
available.
If you’re feeling
rldicls,
if you’re feeling that your
lfsux,
and just want to packuswd and head somewhere warm, you should download
the assembly/architecture reference manuals for the processors you are
developing for.
For the big-ticket consoles you’ll want to check out IBM’s 64-bit
PowerPC programming environment manual.
Sony has some great documentation posted for CellBE development on
their public CellBE site.
For x86/x64, you’ll want part
1 and part
2 of Intel’s
architecture manuals.
ARM’s Infocenter website
has all you’ll need for ARM devices like the Nintendo DS and nearly
every smartphone under the sun.
I should briefly touch on one basic idea behind modern compilers. On
the journey from somewhat legible source code to head scratching
assembly the compiler breaks code up into segments called basic blocks
which are a contiguous segment of code that has exactly one entry
point and exactly one exit point. Many optimizations are limited to
basic blocks and being able to figure out which assembly blocks are
associated with which sections of code is hugely valuable when
debugging.
So, now you have your crash in the debugger and your architecture
manuals by your side, now let’s look at what actually went wrong. Most
good debuggers will indicate what caused this mess — if you’ve hit an
access violation/segmentation fault/htab miss chances are you’re
reading from or writing to a bad memory address. If things just look
off, if your debugger has become lost and is displaying things that
look like nonsense, you might have branched to nowhere as a result of
a corrupted function pointer or vtable entry.
Now that we know what the problem is, let’s figure out how we came to
be in this mess in the first place. Perhaps our code spilled off the
end of an array and trampled over some adjacent data, perhaps some
header files changed and you forgot to rebuild the accompanying
library (not like I’m bitter…). If you have a stomp-detecting
allocator now would be a good time to turn it on. Virtual memory
access protection can be a huge help — setting blocks of memory to
read only or write only depending on what their intended use is can
help flush out errant accesses when they happen rather than millions
of instructions later.
If it was an access violation, look to see what address it was
attempting to load from or store to. Is it zero (or near zero, in the
case of accessing fields of a structure via a null pointer)? Different
address ranges often have their own purporses, did it happen perhaps
near stack memory? Heap memory? Was the function attempting an
operation requiring cacheable memory on non-cached memory, like atomic
updates of video memory? Sometimes a floating point number will end up
in the wrong place and you’ll see values like 0x3f800000 in a general
purpose register.
Did the crash happen in the prologue of a function? If so, check the
ABI to find out how parameters are passed, look at them, and see if
they make sense. Does that first parameter really look like a pointer
to the structure the function was expecting? If you view the data in a
watch window does it look right? I’ve had GPUs in some cases run all
over my data, writing values that were nonsensical but yet followed a
pattern.
Did the crash happen in the middle of a loop? Look for registers that
are used as loop counters. Sometimes these are a bit hard to spot
because of the reordering that happens during optimization but a good
way to identify them is to look for the code that checks loop end
conditions. Usually some comparison and/or conditional branch pair
will point to which registers are used for indexing.
If you crawl back up the call stack you’ll find that your variable
watch window has become mostly useless. When optimizing compilers try
to push values into registers and keep them there as long as possible
and in many cases variable watch information is tied to a stack slot.
There’s a good chance that the value you’re looking for exists in one
of the callee-saved/non-volatile registers, a group of registers
sectioned off by the ABI that says they can only be modifed by a
downstream function if they are first saved to the stack at known
offsets. Because the save slots are specified by the ABI the debugger
can determine their values precisely as you move from stack frame to
stack frame. The x86 platform throws a wrench into the works because
of it’s register envy so it uses the stack quite often even in
optimized builds.
Debugging optimized code is a bit of an art form and will require
patience but with the right tools and knowledge it becomes much
easier. If you have any tips that you’d like to add, or if you’d like
me to expound on anything I’ve touched on here, please chime in below!