Following on from previous post, The unexpected behaviour of debug builds, I thought I’d follow up with some observations on debug performance, again focussing on the effect of the debugger as opposed to the debug build itself.
The Debug Heap (revisited)
This first example shows how the debug heap can really affect your performance. I used two functions for the test, one that allocates a string on the stack and another that allocates a string on the heap. Each function writes a short string into the buffer using strcpy_s, and has a template parameter allowing me to vary the size of the buffer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | template <int TBufferSize> void StackStringCopyTest() { for(int i=0; i<100000; i++) { char buffer[TBufferSize]; strcpy_s(buffer, "hello world"); } } template <int TBufferSize> void HeapStringCopyTest() { for(int i=0; i<100000; i++) { char *buffer = new char[TBufferSize]; strcpy_s(buffer, TBufferSize, "hello world"); delete [] buffer; } } |
When usingĀ a 50,000 character long buffer these were the timings were:
Build | Running Under Debugger | Stack allocation method (milliseconds) | Heap Allocation method (milliseconds) |
Debug | Yes | 1,303 | 16,343 |
Debug | No | 1,295 | 5,023 |
Release | Yes | 6 | >11,624 |
Release | No | 6 | 146 |
It comes as little surprise that debug is slower than release and that running under the debugger is slower than running without the debugger. However what might surprise you is just how significant that difference can be. Because the behaviour of the debug C-runtime is to fill the stack and heap memory on initialisation the vast majority of the time is spent doing that, and not the code we’ve actually written. In some cases the difference between debug and release is more than two orders of magnitude. That’s a lot.
Another surprise for me was the speed of the Debug stack allocation compared to the heap allocation when not running under the debugger. Looking down the right hand column it would seem that filling the 50K buffer takes around 5 seconds. In the first row the fill happens three times (once due to the C-runtime on allocation, and twice due to the debug heap on allocation and then on free), in the second row it happens just once (due to the C-runtime) and in the third row it happens twice (due to the debug heap). Yet in the left column the fill triggered by the C-runtime has happened in just over one second. I’m afraid I haven’t had time to investigate this fully so it’s left as an exercise for the reader.
Some other strange behaviour came to light when I started adjusting the size of the buffer. I expected a fairly linear performance curve, at least until the buffer is small enough that the cost of the string copy becomes significant. But although this held true for the stack allocation, the heap allocation didn’t behave at all as I’d expected. Instead, the performance seems to stay roughly the same (actually getting faster as the block size increases) up to a threshold at which is suddenly jumps. Moreover, once the performance suddenly drops off, it stays bad, even for smaller block sizes. Here’s a small selection of example values:
Buffer Size | Time (ms) |
2100 | 1056 |
2490 | 1007 |
2520 | 1009 |
2550 | 970 |
2570 | 2250 |
2600 | 2204 |
2800 | 2194 |
128 | 2357 |
2550 | 2201 |
As you can see, the execution time is around a second up to 2.5K, decreasing slightly as we approach that limit. But as soon as this boundary is breached the time jumps up and sticks there, even for smaller block sizes. These figures are taken from a debug build running under the debugger, release was much the same but performance without the debugger attached was much more in line with what I’d expect.
The only real conclusion I’ve drawn from this is to never ever profile code with the debug heap enabled, because it’s really weird.
Exceptions
A similar profiling test can be done with the use of exceptions. I’d be very wary of using exceptions at all in performance critical code, but for tools development I find them to be a valuable tool. Having said that, I recently discovered that throwing exceptions excessively can be incredibly expensive when using the debugger. Take this example:
1 2 3 4 5 6 7 8 9 10 11 12 | for(int i=0; i<10000; i++) { try { char buffer[64]; strcpy_s(buffer, "hello world"); throw 1; } catch(...) { } } |
It just runs a loop 10000 times, throwing and catching an exception each time round. The string copy is just there to make sure the compiler doesn’t get too smart at optimising the loop out. When running under the debugger the exception handling slows down dramatically. With the debugger attached the time to execute this loop was 3532ms in debug and 3075ms in release, while without the debugger it was 35 and 34ms respectively. I tried turning off exception messages in the Visual Studio debug output window but that made no difference, I also experimented with attaching and detaching the debugger during execution and unlike the debug heap code, it’s definitely the presence of the debugger at the time the exception is thrown which is the deciding factor.
A 100x slowdown under the debugger may same like an extreme example, however I stumbled across a case recently that was almost as bad. I’d written a Python script within our tool-chain that would scan all objects in the asset database finding ones with a particular property set to a particular value, however my first attempt caused the script to throw an exception for any asset which didn’t have that property. Under the debugger the script took a full 20 seconds to execute, but without the debugger attached the script executed in under 500ms – and that included taking the results, sorting them and serving them up as a formatted HTML page. So that’s a 40x speed differential, even when there’s plenty of real code executing.
In Conclusion
What I recommend based on these findings is simple: If you want to profile code and get a realistic result, make sure you aren’t running under the debugger. That’s it. You could spend a lot of time learning how and why the debug heap makes such a difference, and why exceptions catching in Visual Studio is so slow, but that won’t ever help you get good profiling results.