From the Codeface, a Bug Fixing Story

It was March, 2009, and I was working on a high profile multi platform psychological horror game. We were “in QA”, Milestone 12, and my code was holding up well. I had been at 0-bugs for most of the week, and had spent most of my time testing, and helping whoever I could with debugging whatever problems were bouncing around the team.

One particularly tricky bug from that time stuck in my memory.

It was Wednesday when I got the email from the project’s technical lead.

Our memory profiling system had been spitting out a warning on one of our target platforms that indicated a memory trample. Both the technical lead, and the programmer who implemented the system were snowed under, so I was asked to take a look and find out what the problem was.

Now, at this stage there were two possibilities: either (a) there was a memory trample, but for whatever reason we could only see it on one platform, or (b) the memory profiler code is not working correctly on that platform.

My first step was to put a breakpoint on the line of code that spits out the warnings and run the code in a debugger on the target platform.

Execution stops, and I examine the code in some more detail. The code that fires the warning works as follows: When a pointer to memory is about to be freed it is passed to the memory profiler, which calls a “GetMemSize” function to determine the size of the allocation, then fetches a value from the last 4 bytes of the block and checks if it is the expected value. If the value read isn’t as expected a warning is printed.

The callstack does not make me happy. The code freeing the memory is my code – it’s my generic array class (like std::vector, but awesome). This is code I’ve brought with me and has been in production use on shipping projects for years. It is unlikely, in my opinion, that this code is broken, but not impossible – so I keep looking.

I look at the code using the array – is it doing anything untoward? The array class has assertions on the element access to trap bounds errors, but the iterators do not. The code using the array class does not (that I can see) use iterators. It uses: push_back, pop_back and operator[]. All calls with debug guards to trap misuse.

So I take a step back. The profiling code itself is quite new, so I look at it in more detail. I check where the trapped value is supposed to be set, and find something that might be a lead – a complex nest of preprocessor conditional compilation.

Our memory manager is layered – Micro-Allocator, Turbo-Allocator and then the system (vendor/OS) allocator. The memory profiler is currently turned off for allocations that go via the micro and turbo allocators, so maybe there is a mistake in the way the #ifdefs have been structured that means the value is being checked, but not set up at all.

This seems like a good lead at first, and I follow it for a while, but it turns out that the #ifdefs are all set up correctly. The allocation in question is going via the system allocator, and the profiler is inserting it’s pad value at the end correctly.

Back to square one.

The problem is happening at start up, which is deterministic, and the OS uses a fixed memory model that means allocations always end up at the exact same address run after run. Time for some hardware breakpoints.

I run the game again, find out the address of the value the system is complaining about, and set up a breakpoint-on-write, then restart the run.

I filter my way through the breakpoints hits, checking if what is writing there should be, and find nothing. Nothing relevant at all. Nothing writes to that byte. It is unused. The value is not touched.

What on earth?

So I backtrack. Again. Double check my numbers. Did I have the right address? Yes. Break point the set up function. Breakpoint the creation of the array. Single-step though lots of code. The allocator is setting up the boundary value – but not in the same place as it is being fetched from!

So, a moment of clarity – it isn’t the value at the end of the array getting stomped, but the address the code is getting for the end of the array that is wrong!

Perhaps it’s the value at the start of the array getting stamped?

New hardware breakpoint, at the start of the array this time. Run. Watch were we stop. The breakpoint hits in system malloc and system free. It makes sense. The callstack looks right. Hits in malloc, then free, malloc, free, malloc, free.

Nothing that shouldn’t touch that memory was touching it. I see a bunch of mallocs and frees and then the system falls over.

What on earth is going on?

And then I notice it. The system that is using the array is an XML parser. There isn’t one array – there are loads of them.

This is the mind boggling part: The moment the call to free stamps on the size value of array ‘A’ was not the moment that the memory owned by ‘A’ was freed. It was when some other array was freeing memory.

So: We have a two blocks of memory allocated by the operating system. At the start of each one, we have a 4-byte int for the total size of the block. At the end of each one we have the guard value. We know the location of the guard value by checking the size of the memory block, but when we free ‘B’ the size of ‘A’ changes.

Nice.

Lets have a look at GetMemSize. For the platfrom in question GetMemSize(ptr) returns ptr[-1]-5.

Note the vendor in question does not provide a memstat/memsize function so this is the result of deduction and reverse engineering.

And it’s wrong. The version of the function for another platform by the same vendor has a hack of a work around that does:

1
 
  
if (memSize % 2 != 0) memSize = memSize + 1;

Oh Hello.

So, it seems like the system memory manager uses those first 4 bytes for more than just the size. After all, the memory allocations are all 4 byte aligned, so those first 2 bits aren’t really needed, and it seems like they are used for something else, perhaps to do with the availablity of neighboring blocks for use with realloc?

Who knows.

But the fix now is clear, mask off the bottom bit, and we are good.

1
 
  
UInt32 memSize = (Ptr[-1]-4)&~1;

Perhaps I should’ve masked off the bottom 2 bits?

Who knows, it all seemed to work after that. And the game shipped in December without major bugs.

This story is a rewrite of one I originally posted here: