Its 10:30am, you’ve just arrived at work and your nightly build is telling you all is well; you are ready to start the days tasks. Just as you sit down Bug #497564 comes in, its urgent and its new as of the 9:52am build, you have to get latest (groan). You start to wonder what changes went in since your 4am build started. Did that crazy guy who’s always in at 5am do something radical like switching out the entire math system again? You hover the mouse over the <sync> button and wonder if you can diagnose the issue without syncing. Dammit no. You click sync and fail to hold back the gasp of horror as you see <THAT FILE> go flying by on the source control TTY, the one that everyone winces at resulting in 30 min build time MINIMUM!, time to go get coffee while muttering “why does that damned build take so long”.
It has always been surprising to me that many programmers don’t delve into the reasons why certain processes take so long but we’ll spend weeks/months optimizing the code itself to extract that last 1ms (guilty as charged). What became obvious however once i talked to the few that had tried, was that the tools available to aid this task are somewhat dire and often have issues with the scale of our projects. So I delved deeper and this is what i found.
We have 51 targets (libraries, exe’s, dlls), spread across 11,000+ files ranging from 5-10mins to 100,000+ lines. A full clean build of all our code across the targets we’re interested in can hit ~3hours on a single machine.
These aren’t the droids you’re looking for
The assumption from 9 out of 10 programmers i talked to is that file IO is the root cause of a slow build. Ergo disabling Virus Checking, getting a faster Hard Drive, switching to SSD or using a RAM drive must be the way we speed it up. This sounded fantastically sensible so i investigated speeding up these devices. I chose our mildest target which showed a build time around 7mins +- 5seconds. My laptop already has both an SSD & SATA installed as well as having 12Gb RAM, so i spent some time running tests… I won’t bore you with the details but needless to say i confirmed in each case that the circumstances were well setup and spent a full day running builds of various forms in various locations moving around temporary files, object files, source locations, splitting across drives for temp vs target, disabling virus checking… it was a very busy day for my poor laptop. Sadly the net effect to all these changes was… ZERO. Build time remained steady at m7+-10s for ALL but 1 test which was RAM Drive for source on a fresh reboot vs SSD for source fresh reboot; this is an obvious case as the ram drive would offset the windows cache. Even then the difference was nominal (7m30 for SSD, 6m59s for RD)
It was at this point I realized i’d made the cardinal mistake when optimizing; I must find the problem before attempting a solution.
Open your mind
At this years gamefest Bruce Dawson gave a great talk on a tool i’ve attempted (and failed) to use previously: XPerf. Thankfully his first 5mins was spent lamenting just HOW difficult the tool is to use when you are not a part of Microsoft and can’t simply email the dudes who would know the correct settings. His talk was “How Valve Makes games better using XPerf” and recounted what he did, how he did it and ultimately what he thought other studios should know about it and how we can improve upon it. The net gain was a much greater understanding of the tool AND BATCH FILES!!!. Magical batch files that setup all the internal components required by XPerf in order to provide decent information.
I used this sorcery to profile our code build, 10 mins later I had 13Gb of data and my mind is now open. File IO is the problem, but not in the assumed way. Yes there is a lot of reading and writing of data, but looking at the profile info it seems that most time is taken up with WAITING for a write while another write is progress. Most notably (~90% of cases) this was with the mspdbsrv process which seem to be attempting to service writes from many location to the same file. My theory is that the more concurrent builds (targets build concurrently at the project level for VS2010) the more the requests for write to the PDB back up and eventually the parent process has to stall which stalls the entire pipeline. In my test the Mspdbsrv process is occupying approximately 40% of the CPU time the build uses with ~64 CL’s in flight.
Now i’m obviously a novice when using XPerf so i could easily be reading the information incorrectly.
I’ll be back
Using this information one of our engineers spent some time looking into splitting up the PDB’s themselves. The setup he wrote essentially creates a pdb for every “N” files and his cursory tests shows a significant performance increase in build from doing this. It has not yet been officially timed. In later parts of this series i’ll detail the effects of this change (and others).
Where we’re going, we don’t need roads
There are obviously more brute force methods of optimizing the latency between a code change and a built target however i believe that starting as you mean to go on and actually learning the core issues is worthwhile to any process like this. The subjects i’ll be looking into are:
- hierarchical header optimization
- simple header removal
- pro-active dead stripping
- incredibuild XGE
please feel free to share any information you feel might help as we push forward on this project.
Series