I'http://www.valvesoftware.com/), focusing on optimization and reliability. Nothing's more fun than making code run 5x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
Posts by Bruce-Dawson
  1. Counting comments... )
Technology/ Code /

The Windows Performance Toolkit, also known as slowdowns in PowerPoint by using xperf’s built-in sampling profiler, but that actually understates the true value of Xperf. While I think xperf is a better sampling profiler than most of the alternatives (higher frequency, lower overhead, kernel and user mode), xperf is really at its best when it reveals information that other profilers cannot measure at all.

In short, lots of profilers can tell you what your program is doing, but few profilers are excellent at telling you why your program is doing nothing.


Our story so far

Xperf has a high learning curve. Therefore I highly recommend that you start by reading some previous articles from this series. The entire series can be found here, but the most important ones are:

  • ETW events to your game
  • Xperf Analysis Basics – the essential knowledge of how to navigate the xperf UI, including how to set up symbol paths

The rest of this post assumes that you have installed xperf (preferably the Windows 8 version) and have familiarity with at least Xperf Analysis Basics.

Wait Analysis Victories

I’ve had good luck using Wait Analysis to find many performance problems. Some of these delays were short enough to be difficult to notice, yet long enough to matter. Others were debilitating. All were difficult or impossible to analyze through CPU sampling or other ‘normal’ CPU profilers. Some examples include:

  • Finding the cause of brief startup hangs in Internet Explorer and various games
  • Profiling Luke Stackwalker to find out why it caused frame rate glitches in the game it was profiling
  • Finding the cause of a 10x perf-reduction when upgrading to a newer version of Windows
  • Finding the cause of frame rate hitches during fraps recording
  • Finding the chain of lock contention that caused frame rate hitches on a heavily loaded system
  • Finding the cause of (and a workaround for) repeated 2-6 second hangs in Visual Studio’s output window

The last investigation is the one I want to cover today. It is sufficiently simple and self-contained that I can cover it end-to-end in a single (long) post.

Finding the hang

When profiling a transient problem such as a frame-rate glitch or a temporary hang the first challenge is to locate the hang in the trace. A trace might cover 60 seconds, and a hang might last for 2 seconds or less, so knowing where to look is crucial. There are a number of ways to do this:

  • Find the key stroke that triggered the hang, through logging of input events
  • Use instrumentation in the functions of interest
  • Look for patterns in the CPU usage or other data
  • Use OS hang-detection events

I’ve used all four of these techniques. Our internal trace recording tool has an optional input event logger which puts all keyboard and mouse input into the trace (watch for it). If a hang is triggered by a particular key press or mouse click then finding its start point in the trace is trivial.

Custom instrumentation (emitting ETW events at key points in your game, see the Recording a Trace post) is also a common technique. Emitting an event every frame makes a frame rate hitch plainly visible. However this doesn’t work when investigating performance problems in other people’s code, such as in Visual Studio.

In some cases a hang will be plainly visible in the CPU consumption. One recent hang showed a significant hole in the otherwise consistent CPU usage, plain as day.

A specific event that indicates the time and duration of a hang would be ideal, and Windows 7 actually has such an event. The Microsoft-WindowsWin32k ETW user provider will emit an event whenever a thread resumes pumping messages after a significant delay. Windows Vista and earlier users are out of luck, but on Windows 7 this is often exactly what is needed, and this provider is enabled by my recommended trace recording batch files.

It’s hands on time

I’ve uploaded a .zip file of a sample trace to can be found then you can follow along. This is by far the best way to learn wait analysis.

This trace covers over over ten minutes for some types of data, but the detailed sampling and context switch data only covers 28 seconds, from about 782 to 810 seconds.

Start by selecting the region where we have full data, from 782 to 810 s and cloning this selection.

Our path now depends on whether you are using the new (Windows 8) version of xperfview.exe.

Hands on with old versions of xperfview

While this exact technique is only applicable to (and only works with) old versions of xperfview, the general concept is still applicable and the exploration of generic events is crucial whether looking for your custom events or exploring the built-in events.

Scroll down to the Generic Events table. Right-click the selected region and bring up a summary table. Enable the Process Name column and put it first. Enable the ThreadID column and put it after Field 3. Move the Time (s) column and put it after the ThreadID column. I also hid a couple of columns in order to get my screenshot to fit, but that’s less critical. Now we have all of the information we need in a convenient and easy to read place.

If we drill into the data for devenv.exe and select the MessageCheckDelay Task Name we should see something like this:

The Zen of summary tables is all about looking for data columns that seem useful, enabling them, and fearlessly rearranging columns to group/sort/pivot/spindle the data to answer your question. In this case our question was when does devenv.exe (group by Process Name) hang (group by the Microsoft-Windows-Win32k provider, Task Name equals MessageCheckDelay or InputProcessDelay), and for those events, look at the TimeSinceInputRemoveMs, Thread ID, and Time (s) data.

So now, with relatively little effort, I know that devenv.exe hung (didn’t check for messages) for 5,304 ms, its message pump is running on thread ID 9,536, and the hang ended at 805.666 seconds into the trace.

Cool.

Hands on with the new version of xperfview (6.2.8229)

Microsoft is continuing to develop xperf and if you install the latest version (released Feb 29, 2012, and linked to from here) then there are a couple of options. Wpa.exe has a new UI which shows pretty graphs for UI delays:

I don’t know how to dig in deeper so I can’t tell if it is any use, so that’s all I have to say about it.

The new xperfview.exe has removed the MessageCheckDelay and InputProcessDelay events from Generic Events but has added a new UI Delay Information Graph. If you scroll down to this graph and zoom in around 800 s (in the area where we have full detail) then you should see five reports of hung apps. VTrace.exe (my trace recording application) hung for a while, there are three spurious reports of Internet Explorer hanging, and there is a MsgCheckDelay report for devenv.exe. It’s really too easy.

You can right-click to change the threshold for what delays are reported, or to bring up a summary table of delays. You’ll need to bring up the Delay Summary Table to find out the UI thread ID for devenv.exe.

Select the region around the devenv hang and we’re ready for the next step.

Finding the cause

The MessageCheckDelay is emitted at the end of the hang (805.666 seconds) and it tells us the length of the hang (5.304 s) so we now know the range of the hang quite accurately.

The hang runs from 800.362 to 805.666 seconds so we should zoom in on that area of the graphs in the xperf main window and look at CPU Usage by Process. My system has eight hardware threads (four hyperthreaded cores) so 12.5% utilization represents one busy thread. Even without that context we can see from the graph below that my CPUs are idle for most of the time. There’s a bit of devenv activity (the two blue spikes), but mostly this is an idle hang.

When analyzing an idle hang you should select the entire region of the hang, and it is particularly important to select the end of the hang. It is better to select a few extra tens or hundreds of milliseconds at the end rather than risk missing the crucial events that end the hang. This selection can be done with the mouse or by right-clicking and using the Select Interval command. For easy reproducibility I right-clicked and used the Select Interval command to select the region from 800.0 s to 806.0 s. I then used Clone Selection to copy it to all of the graphs.

Who woke whom?

If a thread is not running, and it then starts running, then there was a context switch that started it (the new thread) running. That context switch is recorded in our ETW trace and contains all sorts of useful information. Include in this information is (for the traces recorded with my recommended batch files) the new process name and thread ID, the call stack which the thread woke up on (which is the same one it went to sleep on), the length of time it was not running and, for threads that were waiting on some synchronization primitive, the thread that woke it up.

Ponder that, because it’s crucial. An ETW trace tells you, for each context switch, how long the thread was not running, and who woke it up. That’s why it is important to have the end of the hang selected, because that is (presumably) the time of the context switch that gets the thread running again.

In the main xperf window go to the CPU Scheduling graph (make sure the correct time range is selected), right click on the selection, and select “Summary Table with Ready Thread” to view all context switches for the selected region together with the readying thread information. Make sure the columns are in this order:

  1. NewProcess – this is the process whose thread is being scheduled to run
  2. NewThreadId – this is the thread ID of the thread being scheduled to run
  3. NewThreadStack – this is the stack that the thread will resume running at
  4. ReadyingProcess – this is the process, if any, that readied the new thread
  5. ReadyingThreadId – guess. Go ahead, you can figure it out.
  6. ReadyingThreadStack – this is the stack of the readying thread when it readied the new thread
  7. Orange bar – columns to the left of this are used for grouping, columns to the right are for sorting and data display
  8. Count – how many context switches are summarized by each row
  9. Sum:TimeSinceLast (us) – the time the new thread was not running (time since it last ran) summed across all context switches summarized by each row

There are more columns, and for deeper analysis they can be useful, but we don’t need them today.

With our columns thus arranged we can quickly find our problem. Find devenv.exe (be sure to find the correct PID if multiple versions are running) and expand it, find the thread of interest (9,536, from the MessageCheckDelay event), then expand the stack. If you click the “Sum:TimeSinceLast (us)” column so the little arrow is pointing down then as you drill down into the stacks (hint: select the top node and then repeatedly press right-arrow) it will go down the hot call stack. In the sample trace, over the selected region, thread 9,536 starts with a total of about 5.523 s of non-running time over 316 context switches. As we drill down we get to a single context switch that ended an idle gap of 5.202 s. That’s our hang, clear as day.

The NewThreadStack for this 5.202 s call stack starts at _RtlUserThreadStart and winds through a lot of Visual Studio code. Microsoft is kind enough to publish symbols for much of VS, as well as for Windows and about fifty rows down we get to the interesting details:

It’s a single context switch (‘count’ goes down to one when we got a bit lower in the call stack) that put the Visual Studio UI thread to sleep for 5.202 s. It doesn’t get much clearer than that.

If we go down to the bottom of the stack and expand the next three columns (compressed in the screen shot above for size reasons) then we can see who woke us, which can also be described as “who we were waiting for”:

In this case it was the System process (thread 5880) in an IopfCompleteRequest call that goes through MUP.SYS. If we know what MUP.SYS is then that gives us another clue as to the root cause, but even without that we know that Visual Studio called CreateFileW and it took a long time to return.

What about the other threads?

In our selected region their are context switch events for 11 threads in devenv.exe. For all of those threads the Sum:TimeSinceLast value is greater than for 9,536, the thread we are investigating. So why aren’t we looking at them?

It’s important to understand that Sum:TimeSinceLast just measures how long a thread was idle, and there is nothing wrong with a thread being idle. A thread being idle is only a problem if it is supposed to be doing something and isn’t. In fact, if devenv.exe has 11 threads then they had better be idle most of the time or else my six-core machine is going to be constantly busy.

Many of the threads have a Sum:TimeSinceLast time of about 15 s, which is significantly longer than the 6 s time period selected. That’s because this summary table shows all of the context switches that occurred during this time period, and the first context switch for these threads was after they had been idle for a very long time, most of that time outside of the selected region.

The reason we are looking at thread 9,536 is because (according to the MessageCheckDelay event) it is the UI thread and it went for 5.304 s without pumping messages. It kept me waiting, and that makes me angry. You wouldn’t like me when I’m angry.

File I/O summary table

Since we know that the hang is related to file I/O we should look at what file I/O is happening during this time period. Luckily this information is recorded by the xperf batch files that I recommend.

On the main xperf window go to the File I/O graph and bring up a summary table, for the same time region we’ve been using so far. Arrange the columns as shown below and drill in as usual. I’m sure this screen shot won’t show up very well, but I can’t shorten it any more. It contains too much glorious information. Click on the image for deeper details:

We can see here that a Create file event, from devenv.exe, thread 9,536, took 5.203 s, trying to open \Device\Mup\…, and that ultimately the network path was not found.

Wow.

It turns out that \Device\Mup, or MUP.sys, means the network. The hang is because Visual Studio tried to open a non-existent network file, and sometimes that causes a 5.2 s network timeout. Hence the hang.

The remainder of the hang is from a few other context switches and CPU time that account for the rest of the 5.304 s, but the one long bit of idle time is all that matters in this case. It’s particularly clean.

What’s the cause?

The file name associated with this hang is quite peculiar. The full name is:

\Device\Mup\perforce\main\src\lib\public\win64\vdebug_tool.lib#227 - opened for edit

That doesn’t look like a file name. That looks more like the output from Perforce. And that’s exactly what it is. At Valve we store build results in Perforce so we have pre-build steps to check these files out. The checkout commands print their results to the Visual Studio output window like this:

//perforce/main/src/lib/public/win64/vdebug_tool.lib#227 - opened for edit

Visual Studio ‘helpfully’ reverses the slashes and decides that this represents a file name on \\perforce\main. Since this whole thing started with me pressing F8 (actually double-clicking the output window in this reenactment) this means that Visual Studio was trying desperately to treat this potential file name as a source-file name associated with an error or warning message.

Oops.

That explains the CResultList::AttemptToNavigate entry on the hang call stack – everything makes more sense once you understand the problem.

Conclusion

Once the cause of the hang was understood I modified our pre-build step to pipe the output through sed.exe and had it rewrite the output so that Visual Studio would no longer find it interesting. This avoids the hang, but also made it so that F8 would take the selection to interesting errors and warnings, instead of to these mundane progress messages. A little sed magic replaces “//” with the empty string, and “...” with “---“ :

sed -e s!//!! -e s!\.\.\.!---!

This changes the hang-prone results before:

to the hang-proof benign text after:

I also reported the bug to the Visual Studio team. Having a trace is very powerful for this because it meant that I could tell them definitively what the problem was, and I could share the trace in order to let them confirm my findings. Just like minidump files are a powerful way to report crash bugs, xperf traces are a powerful way to report performance bugs. The Visual Studio team has told me that this bug will be fixed in Visual Studio 11 – UNC paths will be ignored by the output window’s parser.

Mup.sys is the driver used for network file I/O. Therefore its presence on the Readying Thread stack was a clue that a network delay was the problem. Doing file I/O on the UI thread is always a bit dodgy if you want to avoid hangs, and doing network file I/O is particularly problematic, so watching for mup.sys is a good idea.

Wait chains

Some wait analysis investigations are more complicated than this one. In several investigations I have found that the main thread of our game was idle for a few hundred milliseconds waiting on a semaphore, critical section, or other synchronization object. In that case the readying thread is critical because that is the thread that released the synchronization object. Once I find out who was holding up the main thread I can move the analysis to that thread and apply either busy-thread analysis (CPU sampling) or idle thread analysis (finding what that thread was waiting on). Usually just one or two levels of hunting is needed to find the culprit, but I did recently trace back across six context switches in four different processes in order to track down an OS scheduling problem.

When following wait chains it is important to understand the order of events. If thread 1234 is readied by thread 5678 at time 10.5 s, then any context switches or CPU activity that happen to thread 5678 after that point are not relevant to the wait chain. Since they happened after thread 1234 was woken they can’t be part of its wait chain.

For CPU activity this is dealt with by selecting the region of interest. For context switches this is dealt with by drilling down all the way and then looking at the SwitchInTime (s) column (which you may want to move to a more convenient location). This column records the time of the context switch.

It’s worth pointing out that if you busy wait (spinning on some global variable flag) or use your own custom synchronization primitives (CSuperFastCriticalSection) then these techniques will not work. The OS synchronization primitives are instrumented with ETW events that allow, in almost all cases, perfect following of wait chains. Even if your custom synchronization code is faster (and it probably isn’t) it isn’t enough faster to make up for the loss of wait analysis. The ability to profile your code trumps any small performance improvement.

Can’t any profiler do this?

Sampling profilers and instrumented profilers might be able to tell you that your program is idle, and they might even be able to tell you where your program is idle, but they generally can’t tell you why your program is idle. Only by following the chain of readying threads and looking at other information can you be sure to find the cause of your idle stalls.

It’s also convenient that you can leave xperf running in continuous-capture mode, where it is constantly recording all system activity to a circular buffer. When you notice a problem you can just record the buffer to disk, and do some post-mortem profiling.

It’s not baking

Baking is all about precisely following a recipe – improvisation tends to lead to failure. Wait analysis, on the other hand, is all about creativity, thinking outside the box, and understanding the entire system. You have to understand context switches, you have to think about what idle time is good and what is bad, you have to think about when to look at CPU usage and when to look at idle time, and you often have to invent some new type of analysis or summary table ordering in order to identify the root cause. It’s not easy, but if you master this skill then you can solve problems that most developers cannot.