While doing some research at work into why Firefox sometimes takes a long amount of time to shut down, I discovered that builds of Firefox on my work machine could not even shut down successfully. Every time I tried to shut down, shortly into the shutdown process, I’d get a crash, with a message about heap corruption. At first, I assumed I was dealing with breakage that someone had checked into trunk. When dealing with a large enough software project, the odds are good that pulling down the latest code from trunk will bring you a few bugs to discover. Very often, you can just roll back a few revisions, or find the right person and determine that they’ve already got a fix.
So, I tried those things. I rolled back a few revisions, rebuilt, and it still crashed. I tried rolling back four weeks to a known-good revision, and it still crashed. I went through all the commits over that time period to try and identify any obvious suspects, but didn’t find anything. I tried running the browser with a new, empty profile, on the assumption that perhaps one of my add-ons was causing the problem. Still crashed.
At this point, I could have thrown out my clone of the mozilla source code, along with my build folders, in the hopes that starting over from scratch would fix the problem. I’ve done this before, and sometimes it does make problems go away. But in practice, if everyone does this every time they hit a problem, insidious bugs can creep their way into your build system (or worse, your code base), and when you have a few hundred developers hacking on one project, the time wasted by using that solution adds up. So I decided to try and figure out why FF was crashing on shutdown.
The first thing I needed to figure out was whether the heap was actually corrupted. I started up Firefox inside the Visual Studio debugger and configured it to break on error, so that I could inspect the state of the application. At the point of the crash, here’s what the state looked like, along with some other useful information:
The most immediately obvious piece of information is that the ‘crash’ in this case is actually a breakpoint, triggered by ntdll.dll!_RtlValidateHeap. Furthermore, if we walk up the stack a bit, we can see that this is from msvcr100d.dll!_free_dbg. This tells us that we’re using the debug heap provided by the Visual C++ Debug Runtime, and the validations it performs when freeing a block of memory have failed. This can happen for various reasons, but if it happens, it almost always means something is wrong.
Next, we need to figure out why this is failing. There are a few obvious candidates, so let’s try to rule them out.
Memory corruption would be a likely cause. Corrupting memory that is infrequently used will probably not cause an application to fail, but when you try to free the memory, if the bookkeeping data used by the heap was also corrupted, the free call will fail. To rule that out, I walked up the stack to find free()’s caller and figure out what kind of data I was dealing with:
free(lib->name); lib->name = NULL;
Based on this, we can guess that the data is a name (and if you read more code, it’s clear that it is in fact a single-byte-per-character null terminated string). Finding the pointer being freed within the crashed process and looking at it in the debugger’s Memory window shows a well-formed string (see the Memory 1 and Locals panes in the screenshot above).
So, casual inspection suggests that this is not memory corruption, but it’s still possible. What if the library in question is corrupting memory near the string, and that’s why free is failing when we unload the library?
To rule that out, I climbed further up the stack and discovered that there’s an environment variable you can set that prevents the library from being unloaded. Interestingly enough, setting this variable causes the crash to go away… but one of Firefox’s child processes, plugin-container, then crashes in the same way, unloading a different library. This suggests that it’s at least not specific to one single library.
At this point, I decided I wanted to know where lib->name was coming from. I tried a couple quick searches with DXR (for ->name), but the first didn’t provide any results, and the second provided a bunch of results.
So, instead, I started Firefox up under Heap Profiler, and then attached a debugger and shut it down normally. I followed the above steps to get the address of the pointer being freed, and then opened up the Heap view of the latest snapshot I had taken in Heap Profiler. It was at this point I realized that I had completely failed to anticipate this use case when I was working on Heap Profiler, so I had to manually scroll through the heap, looking for the address I wanted. Whoops. Luckily, the address was near the beginning of the heap, so I found it easily:
Now, given the stack, I knew exactly who was responsible for allocating the string. Closer inspection even revealed why I was getting a crash: The string was being allocated by msvcr100!malloc, but freed by msvcr100d!free. Wait, how is that even possible?
I set breakpoints on the respective calls and started Firefox under the debugger, to ensure that I wasn’t being tricked by DLL load/unload operations that were occurring between the malloc and free calls. After seeing that the data from Heap Profiler matched reality, I decided to do some more digging. I opened up the Modules window to confirm that yes, both the debug and release C++ runtimes were loaded, and used the watch window to find the addresses of free and strdup, respectively (as strdup was the function responsible for calling the release malloc, in this case).
As you can see in the highlighted sections, __imp__free, the import symbol for free, points into MSVCR100D, while __imp___strdup, the import symbol for _strdup, points into MSVCR100. This is pretty confusing, and definitely confirms that the crash we’re seeing is caused by mismatched heaps.
At this point, there are two major questions:
- How are we loading both versions of the runtime in a debug build?
- And, how is this causing strdup and free to get imported from different runtimes?
This is especially troubling, because my local build of Firefox seems to run perfectly, other than this shutdown bug. You would expect mismatched heaps to cause all sorts of trouble. I did some digging around, and arrived at part of the answer to both questions:
This screenshot shows that nspr4.dll is linked against both runtimes, but only imports _putenv and _strdup from the release runtime. Everything else comes from the debug runtime! This explains why the heap mismatch is limited to the one crash I’m seeing, instead of causing widespread problems with Firefox – the effects of this linking problem are (theoretically) limited to code within NSPR, and further limited to code that uses _strdup or _putenv. Very mysterious.
After confirming that NSPR4.dll didn’t link against both libraries when compiled on my home machine, I set out to try and get verbose output from the linker, to identify just why I was getting linked against both runtime libraries. And this is where I made a mistake.
Earlier in the debugging process, I had convinced myself that the problem was reproducible, because checking out various revisions of the Firefox source code and doing full builds had caused the problem to persist. In order to get the linker to actually re-link nspr4.dll from scratch, I decided that I had to delete the contents of my build folder – I couldn’t get it to perform the link operation by simply deleting the output files (probably because I was deleting the wrong ones). So, I deleted my build folder, and performed a full build from scratch. Pouring over the log files for the build, I found the link command in question:
c:\mozilla\nsprpub\config\rules.mk:330:0$ rm -f nspr4.dll c:\mozilla\nsprpub\config\rules.mk:345:0$ link -nologo -DLL -SUBSYSTEM:WINDOWS -DYNAMICBASE -OUT:"nspr4.dll" -DEBUG -MAP advapi32.lib wsock32.lib winmm.lib ./prvrsion.obj io/./prfdcach.obj io/./prmwait.obj io/./prmapopt.obj io/./priometh.obj io/./pripv6.obj io/./prlayer.obj io/./prlog.obj io/./prmmap.obj io/./prpolevt.obj io/./prprf.obj io/./prscanf.obj io/./prstdio.obj threads/./prcmon.obj threads/./prrwlock.obj threads/./prtpd.obj linking/./prlink.obj malloc/./prmalloc.obj malloc/./prmem.obj md/./prosdep.obj memory/./prshm.obj memory/./prshma.obj memory/./prseg.obj misc/./pralarm.obj misc/./pratom.obj misc/./prcountr.obj misc/./prdtoa.obj misc/./prenv.obj misc/./prerr.obj misc/./prerror.obj misc/./prerrortable.obj misc/./prinit.obj misc/./prinrval.obj misc/./pripc.obj misc/./prlog2.obj misc/./prlong.obj misc/./prnetdb.obj misc/./praton.obj misc/./prolock.obj misc/./prrng.obj misc/./prsystem.obj misc/./prthinfo.obj misc/./prtpool.obj misc/./prtrace.obj misc/./prtime.obj io/./prdir.obj io/./prfile.obj io/./prio.obj io/./prsocket.obj misc/./pripcsem.obj threads/./prcthr.obj threads/./prdump.obj threads/./prmon.obj threads/./prsem.obj threads/combined/./prucpu.obj threads/combined/./prucv.obj threads/combined/./prulock.obj threads/combined/./prustack.obj threads/combined/./pruthr.obj md/windows/./ntmisc.obj md/windows/./ntsec.obj md/windows/./ntsem.obj md/windows/./ntinrval.obj md/windows/./ntgc.obj md/windows/./w95thred.obj md/windows/./w95io.obj md/windows/./w95cv.obj md/windows/./w95sock.obj md/windows/./win32_errors.obj md/windows/./w32ipcsem.obj md/windows/./w32poll.obj md/windows/./w32rng.obj md/windows/./w32shm.obj md/windows/./w95dllmain.obj ./nspr.res
I dug around further in the log files to locate the nspr4.dll in question, deleted it, and then ran the link command with the additional -VERBOSE flag to get verbose linker output. Pouring through the verbose output from the linker, I located the two functions in question:
Found __imp__free Referenced in prlog.obj Referenced in prlink.obj Referenced in prmem.obj Referenced in prdtoa.obj Loaded MSVCRTD.lib(MSVCR100D.dll)
Found __imp__strdup Referenced in prlog.obj Referenced in prlink.obj Loaded OLDNAMES.lib(strdup.obi)
Found __imp___strdup Referenced in OLDNAMES.lib(strdup.obi) Loaded MSVCRTD.lib(MSVCR100D.dll)
It was at this point that I realized my mistake: NSPR4.dll was now linking correctly. Somehow, my previous attempts at building older revisions of the code had not fixed the build, but entirely wiping out my build folder had fixed it. Now I felt like a moron.
On the other hand, the mystery grew slightly more mysterious: Somehow, strdup was different from free – the verbose linker output, even from this valid build, was different for the two! strdup, for some reason, was defined in OLDNAMES.lib, while free was defined in MSVCRTD.lib. Why the difference?
A quick search of MSDN reveals the truth: The POSIX strdup function is deprecated beginning in Visual C++ 2005, and you should use the ISO C++ conformant _strdup function instead. strdup is an alias to _strdup! This means that the way the linker treats it is different from the way the linker treats free.
Armed with this knowledge, I stumbled across a discussion forum post where someone encountered the same problem, by accidentally linking against both runtimes, and that post linked to a bug report on Microsoft’s Connect website. The bug report not only confirms that the crash in question is caused by having both heaps loaded, but describes the behavior as by design, and points out that strdup and free are being loaded from different runtimes.
And now, at the end of this long, frustrating journey, I have acquired many valuable things:
- A further distrust for huge build systems, and linkers in particular
- The knowledge that loading both debug and release C++ runtimes is definitely not a good idea
- The knowledge that aliases and exported functions in libraries are treated differently by the linker
- A Heap Profiler to-do item: Add support for searching the heap by address
- And most importantly: Don’t delete your build folder while debugging, even if you think you can reproduce the problem!
(As an aside, while debugging this problem I also discovered that debug builds of Firefox crash on my personal machine but not my work machine – but the crash only occurs if I run firefox.exe from within a bash shell. I kind of hate computers.)
I hope this short tale of woe has given you some new ideas about how to debug the next mysterious crash you encounter. If it hasn’t, I hope it at least entertained you!