Analyzing Wait Events for Faster Database Troubleshooting
March 29, 2017
Most database administrators, regardless of platform, are familiar with the concept of wait events. Wait events are a natural part of a database engine processing requests. When a connection is made to the database engine, a session ID (SPID) is assigned, which allows for requests to be sent; we often just call these queries.
Each request can be in one of three possible states: running, runnable, or suspended. Suspended just means that the SPID is waiting for a resource, such as a page to be read from disk into memory. When a SPID is suspended, the reason it is waiting is logged as a wait event. Runnable means the SPID is waiting for an available scheduler, usually called a processor or CPU. Running means exactly that: the SPID is currently running.
Being a mathematics major back in college, I enjoy putting things into groups. That’s why when I explain the concept of wait events, I like to think about them in groups. The first groups—running, runnable, and suspended—we have already discussed. The next set of groups are the wait events themselves, and I’ll categorize them using the following names: internal, resource, and external.
Let’s take a look at what each of these groups mean to me:
Internal Waits
Internal waits are commonly known as locking, blocking, and deadlocking. The reason locking, blocking, and deadlocking happens is due to transaction isolation, a necessary setting for relational databases as they maintain the ACID properties of a transaction. Over the years, I have been called to investigate performance issues and found that users simply experience the effects of the chosen isolation level—the default for Microsoft SQL Server is READ COMMITTED, and for Microsoft Azure SQL Database, READ COMMITTED SNAPSHOT.
Differences in isolation levels can affect application performance and throughput. This doesn’t mean that if you see locking and blocking, you should alter your transaction isolation level. You will need to investigate why the locking is happening. The point I want to make here is that this group of internal waits will (likely) not be solved by throwing hardware at the problem. While it is possible that the root cause may be a resource constraint (see below), you will want to investigate changes to your code and schema before spending thousands of dollars on new hardware to “fix” locking, blocking, or deadlocking.
Resource Waits
These are the more commonly known waits that we see and hear often, with names like PAGEIOLATCH_EX or CXPACKET. These waits are logged when a session is in the suspended queue as described above. I group these as resource waits because they are tied to one of the four main physical hardware resources: CPU, memory, disk, and network.
Here’s the thing you need to understand about resource waits: there are only two possible solutions. First, use less. Second, buy more. Consider physical memory as an example. If you have a query that consumes your entire buffer pool, causing performance issues for every other query, you would either tune the query or add more memory to the instance. Once you start to think about resource waits in this manner, it becomes easier to decide what actions to take to solve a problem.
External Waits
These waits are logged when a session is running(!) and are easily spotted because they all start with the word PREEMPTIVE in their names. These waits are known as non-cooperative waits, because SQL Server is being forced to give up control of the scheduling for these tasks. This is often the result of SQL Server needing to do an external—thus, my name for this group—call to the operating system for something such as an external stored procedure, or the use of SQL CLR objects. It is interesting to note that as far as SQL Server is concerned, the task is running, but SQL Server has no idea about the status of the remote call. This makes troubleshooting these waits tricky.
To further complicate matters, these waits are not very well documented. The best way to research these waits is to do a search for the name of the wait, but remove “PREEMPTIVE.” For example, for the PREEMPTIVE_OS_GETADDRINFO wait event, go to the Windows Dev Center and search for GETADDRINFO.
In Closing
I find that thinking in groups helps me get to the root cause of issues faster. When it comes to wait events, this thinking along with using wait-based analysis tools has served me well through the years. In the same process as a decision tree, we can take the wait event, assign it a group, and know the tuning options that we have from there. The end result is an efficient use of resources to solve the problem, but more importantly, we will have a faster time to resolution for issues.
About the Author
You May Also Like