Often times a compiler bug is declared the cause of a seemingly impossible outcome in the code. This is usually followed a few minutes later by a realization of the real issue, and subsequently fixing it. This perhaps happens more often to novice programmers, unaware perhaps of memory issues and (for C++ at least) subtle language trickery that may occur unintentionally.
Sometimes though, just sometimes, it really *isthe compiler. If you are working in certain environments you may be lucky enough to deal with various broken and untested compilers, virtual machines and/or languages. Issues in this sort of environment require a different kind of approach to regular debugging.
In an environment where you cannot trust the compiler, normal logical debugging thinking doesn’t quite apply the same way. You need to consider the trustworthiness of a given piece of code to be doing what you think it is. If you have a moderately large codebase running problem-free then you can assume most ‘core’ operations are generating correct code, but just because something worked yesterday you you cannot assume that a change elsewhere will not shift the alignment/size/phase of the moon and break it all.
So what sort of things should we be wary of?
While logically these have no effects, the compiler may strip these in one section of the code, and leave it in the offset calculations, causing, for example, a variable access to be incorrect. If you really need to keep them around, try assigning them dummy values to keep them ‘used’. Be particularly wary of this in shader code, especially with values bound to input/output semantics.
This includes things like self-assignment, or calculations that are overwritten afterward. While again, logically this should be irrelevant, usual logic does not apply in the land of broken compilers. Change things around, observe the effects. There can be bugs in register allocation logic, or in the logic in deciding to recalculate a value or keep it on the stack.
Compilers will be operating very differently (or simply broken) at different optimization levels. Change the optimization level and check your problem again. If it simply vanishes, this can be a good indicator of a compiler issue (but not always!). While the typical assumption is that non-optimized builds work, this may not always be the case. Compile with -O1 or -O2 and try again!
Less frequent operations or type conversions should raise suspicions. Things like (int % float), excessive type conversions in complex expressions, etc.
These issues can span native code, shader code, virtual machines and scripting languages. Each of these environments will have a different set of likely problem areas. Consider the stability of each particular compiler and how correct you expect it to operate. Print everything, check everything, and assume little. Usual application of pure program logic does not apply – focus more on empirical testing and evaluating what /actually happens/ when code changes occur rather than what /should/ happen. Be aware that logging may easily alter the nearby generated code.
These are some great suggestions from Ben Carter after reading the first draft of this article. I really can’t improve on these, so I will post them here verbatim.
- Check the assembler – as long as you don’t get into the realms of “is this a hardware bug”, then investigating the compiler output can tell you a lot about what it is thinking.
- Make sure variables are where you think they are – a lot of “popular” compiler bugs revolve around register allocation or memory layout going wrong, and so ending up with the members of a structure out-by-one or two variables “sharing” a register isn’t uncommon.
- Watch out for control flow – similarly, it’s not uncommon for compilers to cock up control flow, especially “optimising” things in and out of loops or conditions such that they no longer work properly. On $PLATFORM we once had a bug which was down to the compiler taking “if (blah) a = *blah;” and optimising the *blah dereference to happen before the if()…
- The linker is only human too – lots of address calculations and suchlike take place at link time, and if that goes wrong then you can be in for a world of hurt. Multiple copies of global variables, globals sharing the same memory, alignment errors, and or simply “getting the address wrong” are all things I’ve seen go wrong there. (a certain compiler which shall remain nameless once had a charming feature whereby if you had a struct that was >64K in size, the calculation of member addresses after the 64K boundary would get progressively more wrong the further it got…)
- If you’re dealing with multithreaded code (or indeed interrupts), be very, very, very, very suspicious about optimisations around mutexes and other synchronisation primitives – it is not uncommon to need to explicitly tell the compiler not to perform certain actions (like rescheduling instructions across these boundaries), especially if you are rolling your own code for this.
- Aliasing. This is an utter minefield for programmers even when the compiler is doing everything 100% by-the-book, but it turns into a total disaster zone if there are bugs around too. Try –no-strict-aliasing or similar and see if your problem goes away.
- Debug information. Don’t forget that if the compiler has made a mistake with structure layout or object addressing, that same error has most likely propogated to the debug information too. Sometimes that makes it easier to spot, sometimes it means that the only way to see the problem is to do the math by hand.
- Doubt the debugger – a close relative of the point above, don’t forget that the debugger may well be lying to you. Incorrect values for variables is sufficiently common that most people keep an eye out for it, but the debugger can be the cause of more subtle issues as well. One of the hardest bugs I’ve ever had to try and track down was a gameplay problem that was for some reason only reproducable on a single programmer’s machine – after hours of tracing through the code, we finally narrowed it down to a single load-constant instruction which was quite clearly disassembled as “load 5 into EAX” but single-stepping over it produced a completely different value in the register view. After a brief phase of wondering if we’d found an incredibly bizarre hardware bug, we realised that the debugger had somehow managed to get a stray breakpoint set… on an address half-way through the instruction. So every time we hit “run” the debugger would patch an “int 3″ in, overwriting the immediate constant, and when we stopped it would take it out again, meaning that in the disassembly nothing looked wrong at all. *headdesk*
Ideally you want to find a reproduction case, and then do further testing to isolate compiler switches, optimization levels, etc. Anything you check should be re-tested as you proceed – any changes you make may have invalidated previous assumptions. Isolate the problem area, check the assembly, and be certain there is a compiler bug. Try to reproduce in an empty test environment. Send an angry mail^W^W bug report to your compiler vendor.
While most development doesn’t operate in such a wasteland of compiler wreckage, these things are all real possibilities and can happen. You can even have a whole load of them happen in the space of one week and make you late for an altdev blog post.
Till next time!