Use Xperf's Wait Analysis for Application-Performance Troubleshooting

A powerful capability for pinpointing the cause of UI hangs and contention issues in an application

Trey Nash

December 21, 2011

13 Min Read
ITPro Today logo

Xperf has been a common theme in my recent articles, for very good reason. The tool's capabilities, coupled with the Windows operating system'sbuilt-in Event Tracing for Windows (ETW), are seemingly limitless.

Imagine that you have a desktop application that suffers some sort of performance problem in which the UI hangs intermittently. Perhaps the problemhappens only in select environments outside your test lab. You are unable to reproduce the issue and therefore unable to capture the situation in alive debug. What should you do?

Often, if an application's UI hangs, it's because the thread that is responsible for pumping messages has temporarily quit doing so. Perhaps the UIthread is performing some long CPU-bound work. In that case, you can use Xperf as a sample profiler, to find the code that is responsible for the CPUspike. I wrote about this scenario in my previous article, "Windows Application Performance Profiling with Xperf."

Another potential cause of the UI freeze is that the message pump thread has entered a wait state, waiting for some object (e.g., an event) to becomesignaled. In this case, sample profiling with Xperf won't help you isolate the problem. You need Xperf's wait analysis capabilities instead.

What is wait analysis? Essentially, Xperf can use the data in the captured .etl (Event Trace Log) file to match a thread that is waiting on an objectwith the thread that signaled that object. This capability is extremely powerful because it can also help you diagnose general latency and contentionwithin an application. In Xperf and ETW parlance, when a thread is switched out from the running state on a processor and a new thread is switched in,the ETW Kernel Logger can generate a CSwitch event to log the transition. Naturally, a thread that enters a wait state is immediately switched out. Inaddition, when another thread signals an object on which threads are waiting, the signaling thread also places the waiting threads into the readystate. This operation is captured by a ReadyThread event. Xperf, with its amazing analysis capabilities, can match related CSwitch and ReadyThreadevents, allowing you to follow the chains of these events.

The Sample Application

To demonstrate this capability, create a simple MFC application within Visual Studio (VS) 2010. I created a project called StuckUI and used the MFCApplication wizard to create a dialog-based MFC application. Then, I placed one button on the dialog and wired that button to the following handler:

void CStuckUIDlg::OnBnClickedButtonWork(){    HANDLE hEvent = CreateEvent( NULL, TRUE, FALSE, NULL );    QueueUserWorkItem(        WorkFunction,        hEvent,        WT_EXECUTEDEFAULT    );    WaitForSingleObject( hEvent, INFINITE );}

Notice that in this handler, I create an event, hand it off to WorkFunc via the process thread pool, and then wait on it. Since the button handler runsin the context of the main UI thread, I can easily hang the UI in my application by writing WorkFunc like this:

DWORD WorkFunction( void* param ){    HANDLE hEvent = (HANDLE) param;    Sleep( 5000 );    SetEvent( hEvent );    return 0;}

This approach will effectively hang the UI thread of the StuckUI application for roughly 5 seconds. Now, let's use Xperf to diagnose the problem.

Capturing the Trace

Since I'm on an x64 platform, I built my sample application, which I named StuckUI.exe, to match that platform. To capture the trace, I used Xperf:


Xperf -on Base+CSWITCH+DISPATCHER -stackwalk CSwitch+ReadyThread -BufferSize 1024 -MinBuffers 50 -MaxBuffers 50StuckUI.exeXperf -d StuckUI.etl


There are several things to note when capturing the trace in this way. Notice that I used the Base kernel group and added CSWITCH and DISPATCHER. Thelatter two kernel flags capture the CSwitch and ReadyThread events that are necessary to perform wait analysis. Additionally, the CSwitch andReadyThread flags after the stackwalk argument capture the necessary stacks when these events are generated. If you are suspicious that you might alsohave a CPU spike and haven't yet ruled that out, you can replace Base with Latency and add Profile to the stackwalk list. However, if you've alreadyruled out a CPU spike, then it's always best to capture only the necessary data so that your trace file does not grow unnecessarily large.

Since CSWITCH and DISPATCHER can quickly generate many events, I also gave Xperf a hint about how much of a memory buffer to allocate in non-pagedmemory. In this case, I specified a buffer size of 1MB and told Xperf to pre-allocate 50 buffers. I accomplished these tasks by setting MinBuffers andMaxBuffers to the same value -- 50 -- which reserves 50MB of non-paged memory. Depending on your situation, you might need to massage these buffernumbers so as not to lose events. As you can imagine, Xperf is extremely fragile to missed ETW events and will notify you of such events later, whenyou use Xperf to analyze your trace. It is next to impossible to trust the Xperf analysis if events are missing from the trace.

Don't forget that to capture stack traces on x64 platforms, you must disable paging of the executive and restart, if you haven't already done so. Youcan conveniently set the necessary registry setting by using this command line:

REG ADD "HKLMSystemCurrentControlSetControlSession ManagerMemory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f


Analyzing the Trace

Now the fun begins! Open the resulting ETL file in Xperf. You're presented with several graphs, depending on the configuration of Xperf, which isautomatically saved each time you run the tool. (Xperfview.exe is the application that is used for analyzing the ETL. You can launch this applicationimplicitly, simply by passing your ETL file to Xperf.exe.) You should see a graph called Process Lifetimes.

If you don't see the Process Lifetimes graph, you might need to enable it via either the Graphs menu or the fly-out on the left side of the main Xperfwindow. Incidentally, if you don't see the option to turn on a specific graph that you're looking for, it's usually because your trace was not set upto capture the relevant information for that graph.

I started and stopped the sample application during the trace collection, so you should see diamond shapes in the Process Lifetimes chart; these shapesindicate the start and end times of the StuckUI.exe process. Notice that most of the processes that are running on the machine are grouped together inone line, with no specific start and stop times other than the beginning and end of the trace. If you want, you can separate all the processes byright-clicking and choosing Toggle Permanent Process Grouping.

If I know that the trace has captured the start and end times of my process, I like to select that range within the Process Lifetimes graph,right-click it, and select Clone Selection. This action replicates the selection across all the currently visible graphs. 'Notice that the start andstop diamonds act as snap points, making it easy to select the exact range of the StuckUI.exe process lifetime, as I've done in Figure 1.


Figure 1: Process Lifetimes graph



To start the wait analysis, first ensure that you have set up Xperf to point to the Microsoft public symbol store, as well as the path that containsthe private symbols for your application. For example, my symbol path, which includes the directory to the Project Database (PDB) for StuckUI.exe on mymachine, is as follows:

srv*c:symbols*http://msdl.microsoft.com/download/symbols;C:StuckUIx64Debug

To set up the symbol paths, select Trace, Configure Symbol Paths, from the menu. To load the symbols, choose Trace, Load Symbols. You mightoccasionally see the Xperf UI freeze or hang. This issue is caused by symbol loading and could happen at various times, since Xperf loads symbols ondemand. Depending on how many symbols you already have cached and the speed of your connection, these hangs could last for several minutes.

Alternatively, you can set the environment variable _NT_SYMBOL_PATH to set up the symbol path for Xperf. In the Configure Symbol Paths dialog box, youwill also see a value for _NT_SYMCACHE_PATH, which typically defaults to C:symcache. This is an internal, Xperf-formatted symbol cache. Xperf reallyneeds only a subset of the symbol information from the PDB. Therefore, it skims the relevant information from the PDBs and puts it in a file in symbolcache. If your symbol cache is sufficiently complete, you might not even need to set up _NT_SYMBOL_PATH and can simply point Xperf at the symcachepath. This option might help avoid the annoying delays caused by Xperf symbol loading.

Now, enable the CPU Scheduling graph (if you haven't already) and select the appropriate range, which might even be the entire span of the trace (inwhich case no selection is necessary). My wait analysis graph looks like the one in Figure 2, but yours might look much different, depending on theamount of cores and work that are active on the machine during the trace.


Figure 2: CPU Scheduling graph


Next, right-click the CPU Scheduling graph and select Summary Table withReady Thread.

The most powerful component of the summary table is the gold bar, or aggregator bar. Any column that is enabled and is to the left of this bar becomesa sorting column. The ordering of these columns is important. For example, my summary table begins with the following columns, which are all to theleft of the gold bar: NewProcess, NewThreadId, NewThreadStack, Max:Waits (us), ReadyThreadStack, ReadyingProcess, and ReadyingThreadId. So first, Isort by process, then by thread ID, then by thread stack, and so on. If one of these columns is not visible in your chart, then make sure that it isenabled in the menu or the fly-out. To reorder the columns, you can drag and drop them. If you have many columns enabled and cannot see them allwithout horizontal scrolling, you can drag the column names within the fly-out to reorder them.

After locating the StuckUI.exe process in the NewProcess column, click the plus sign next to that process to expand the tree and reveal the threadswithin the process. You will then see the process threads in the NewThreadId column. In the NewThreadStack column, you should see two expandable itemsnamed [Root]. These items are the stacks that were captured during each CSwitch event. As you drill down the stacks and expand them, you will encounter"forks in the road," so to speak, where the call tree branches.

In an analysis like this, the more information you have, the easier the analysis will be. For example, I know that the UI message pump is stuck andthat a UI window (HWND) is associated with one thread at most. Knowing these things, I can use Spy++ to identify the thread that owns the UI. In thisexample, identifying the right thread is easy because StuckUI.exe has only two threads recorded in the ETL. But if I were troubleshooting explorer.exeand an extension that it loaded, then knowing which of the many Explorer threads to focus on would be helpful indeed. In Figure 3, I have drilled downa few levels into the stacks of thread 9888, which happens to be my UI thread.


Figure 3: A portion of the CSwitch stacks for thread 9888



Now, here's the cool part! For each path in the stack tree, the Max:Waits (us) column displays the maximum amount of time, in microseconds, that theparticular code branch has waited. I observed that the UI was stuck for about 5 seconds, so it isn't surprising to find a value close to 5,000,000 inthis column. If you keep drilling down into the stack, following the branch paths of the maximum wait in which you're interested, you will get to thetop of the stack, where the thread was switched out as it entered the wait state. That's when the magic of Xperf's wait analysis reveals itself,because when you reach that point in the NewThreadStack column, you will find some vital information in the ReadyThreadStack column, as Figure 4 shows.


Figure 4: Top of the waiting thread


Notice that I've called out the location in the waiting stack where CStuckUIDlg::OnBnClickedButtonWork has called WaitForSingleObjectEx, leading to thethread's wait state. If you expand the ReadyThreadStack column, you will see the exact call stack that readied the waiting thread. In theReadyingProcess and ReadyingThreadId columns, you'll see values that indicate which thread of which process performed the readying work. If you'rewondering on which kind of object the waiting stack was waiting, the readying stack might contain the answer. For example, the marked area in Figure 5shows the call to SetEvent in the readying stack.


Figure 5: The readying stack

You might have noticed that the other thread within StuckUI.exe also had a max wait that was close to 5 seconds. If you drill down into that stack, asFigure 6 shows, you can see the reason.


Figure 6: The sleeping thread

The waiting stack indicates that this thread was the worker thread that called our WorkFunction and, subsequently, SleepEx. The sleep timeout passedwas 5 seconds, which explains the max wait value. However, take a look at the readying stack and the associated process and thread ID. The thread ID ofzero indicates that it is the idle thread that executes when there is essentially nothing for the processor to do. The idle thread is also responsiblefor putting the processor to sleep momentarily, to conserve energy. But why is the idle thread the readying thread? Because SleepEx is implemented byusing a kernel timer, which fires a Deferred Procedure Call (DPC) to KiTimerExpiration within the kernel at the required time to expire the timer. Inaddition, the idle thread is a common point from which DPCs are executed. To learn more about DPCs, refer to Windows Internals 5th Edition, byMark Russinovich, David Solomon, and Alex Ionescu (O'Reilly Media).

The last thing I want to point out is that the ETL also recorded the exact time that the waiting thread was switched out, that the readying threadreadied the waiting thread, and that the waiting thread was allowed to execute again. You can see this information by arranging the LastSwitchoutTime(s), ReadyTime (s), and SwitchInTime (s) columns just to the right of the gold bar, as shown in Figure 7, where the information is revealed after youexpand the readying stack.


Figure 7: Thread timings



Notice that there is a small amount of time between the thread being readied and the thread actually being switched back onto the processor. This delayis understandable: When a thread is readied, it is placed in the ready queue for the processor. The ready queue is priority-based, so the readiedthread might have to wait behind other ready threads that have a higher priority.

An Invaluable Tool

Yet again, Xperf proves its amazing capabilities to analyze your application's execution. (For more details about Xperf, see the sidebar "Where to GetXperf" below.) In this article, I've demonstrated how you can use Xperf to track down waiting threads and discover how to set them free. However,this is only one use of Xperf's wait analysis. You can also use it to analyze general latency issues, by analyzing which code paths are waiting onwhich threads. Given the advances in parallel development libraries, this capability is invaluable in validating the performance gain of theparallelization of applications. Happy troubleshooting!


Trey Nash ([email protected]) is a senior escalation engineer at Microsoft working on the Windows OSs and various other products. When he is not workingfeverishly within the bowels of the OS, he is delivering training on .NET platform and kernel mode debugging or playing ice hockey.

Where to Get Xperf


Xperf is part of the Windows Performance Tools (WPT) Kit, which ships with the Windows SDK. As of this writing, the most recent release is contained in Windows SDK 7.1. You can learn more about the WPT Kit here.

Read more about:

Microsoft
Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like