Using Index Operational Stats to Drill Into High Latch and Lock Waits

Part Two of a series on dm_db_index_operational_stats

Tim Ford, Owner

August 26, 2016

16 Min Read
ITPro Today logo in a gray background | ITPro Today

In part one of this series on the index operational statistics Dynamic Management Function (sys.dm_db_index_operational_stats) I focused on showing how different query behavior manifested itself in the results returned from querying this Dynamic Management Function (DMF). I also provided you with a generic script to query the full width of columns from that DMF and compared the information that can be derived from this object to that of its companion Dynamic Management View (DMV): sys.dm_db_index_usage_stats. In this next article in the series I’ll be exploring how you identify those objects causing the most issues around lock and latch waits: when you should be concerned enough to look at latching and locking contention as well as what queries to run to identify your main culprits. Finally we’ll look into general points you’ll want to then focus on when it comes to using this information to resolve locking and latches that are unreasonable detractors from a high-performance SQL Server environment.

Recap

For a complete picture of the basics on sys.dm_db_index_operational_stats please review the first article in this series.  As a refresher here is how you query sys.dm_db_index_operational_stats and the high-interest columns you should be concerned with understanding thoughout this series:

SELECT * FROM sys.dm_db_index_operational_stats(    DB_ID(),     ,     ,     );

You’ll note what may be an odd coding construct for you if you’re unfamiliar with Template Parameters in SQL Server Management Studio. If you’re not familiar with the concept of Template Parameters like I used above please consult this article.  The tldr; (too-long-didn’t read) on this is to use the Ctl+Shift+M keyboard shortcut in SQL Server Management Studio (SSMS) when you see the syntax of to replace those placeholders with values for your specific needs.

If left unadulterated that command will provide you with the results covering all objects (indexes and heaps) and any associated indexes without regards to limiting to a specific partition. Of course by doing so you get a great amount of information returned but it has little value because you have no context for the results. This is why I always end up joining the indexing DMOs to other system views that provide that context for the results (as well as filtering down the rows that are returned as well as the columns I care to see.  For that contextual requirement those system views are:

  • sys.indexes – provides information on your SQL Server indexes scoped to the database level, including name, type of index (clustered, non-clustered), uniqueness, and more.

  • sys.objects – while I could use the OBJECT_NAME(object_id) system function to return the name of the table or view associated with the object_id exposed by sys.dm_db_index_operational_stats I also need to filter the results because I’m only interested in user objects – not those system tables and views that are used internally by SQL Server. To do so I need access to the is_ms_shipped column in sys.objects. While I’m there I may as well return the object name (name column) and type of object (type_desc).

That leaves us with a base structure of:

SELECT O.name AS [object_name] , O.type_desc AS object_type, I.name AS index_name, I.type_desc AS index_type, ixO.*FROM sys.dm_db_index_operational_stats    (        DB_ID(),         ,         ,             ) AS ixO    INNER JOIN sys.indexes I         ON ixO.object_id = I.object_id             AND ixO.index_id = I.index_id    INNER JOIN sys.objects AS O        ON O.object_id = ixO.object_idWHERE O.is_ms_shipped = 0;

Of course you’ll want to narrow the column scope down for sys.dm_db_index_operational_stats and also provide some level of filtering through employing search predicates via a WHERE clause but this construct is a good foundation for where we will go from here. For more information on the full range of columns in sys.dm_db_index_operational_stats and their purpose please review that first article in this series.  That being said let’s move on and focus on sys.dm_db_index_operational_stats, locking, and latching.

A Very Brief Discussion on Locks and Latches

One of the most important insights you can collect from sys.dm_db_index_operational_stats is focused upon which objects victim to the most latch and lock wait time. Before I continue to show how to identify these contributors to performance degradation I want to take a few paragraphs of your time to give you a primer on just what latches and locks are in terms of Microsoft SQL Server.

Locking is critical in a relational database to allow for transactions to meet the ACID test:

  • Atomicity: a transaction either needs to complete all its steps or fail to the original state.

  • Consistency: a transaction will always behave in identical fashion.

  • Isolated: transactions are protected from external influence until they complete.

  • Durable: the RDBMS will maintain a record of uncommitted (aka incomplete) transactions in the event of recovery due to failure.

These requirements are met through complex locking processes to ensure that concurrent requests from users do not affect or interact with the outcomes of any other requests. Locks are taken on rows or pages that are involved in open transactions and are released when those transactions are committed or are rolled back. Locks are also taken on objects involved in schema changes to those objects. There is a complex set of rules and logic around locking behavior in Microsoft SQL Server and there are various types of lock modes as well (briefly outlined below):

  • Shared (S) Locks: involved in read-only operations that don’t modify data.

  • Update (U) Locks: used in UPDATE transactions to prevent multiple transactions from attempting to update the same row concurrently.

  • Exclusive (X) Locks: associated with operations that will change data through an INSERT, UPDATE, or DELETE query to ensure more than one transaction does not attempt to change the same row at the same time as another transaction.

  • Schema (Sch) Locks:  granted when an object’s definition is being modified – for instance if a column is being added to a table.

  • Intent (I) Locks: these locks are applied to enforce a lock “pecking order” as it were. Consider this a lock that informs the internal engine that a transaction is queued to eventually take the next lock of a type of IS (intent shared), IX (intent exclusive), or SIX (shared, with intent exclusive).

  • Bulk Update (BU) Locks: used in certain conditions when employing bulk copying and supplying locking hints as part of the operation.

  • Key-Range Locks: These locks come into play when using the most restrictive transaction isolation level (serializable – see brief explanation of transaction isolation levels in the next paragraph.) This type of lock protects a range of rows – not simply a single row involved as the target of a transaction.

Certain locking behavior can change based upon what isolation level a transaction in Microsoft SQL Server is employing. An isolation level identifies the degree to which transactions are isolated from one another; what types of locks prevent other locks from being applied and enforced for example. Transaction isolation is a much larger topic and is not covered additionally here. For more information on isolation levels in Microsoft SQL Server please view the official documentation here.

Latches are often explained as “SQL locks on memory”.  In a general sense this is the behavior that latches manifest, but it’s not an accurate description. Latches are like locks in that they provide controlled access to the object on which they are applied. Locks control access to database rows, pages, views, and tables among other objects. Latches, on the other hand, provide controlled access to SQL Server memory-resident objects. These objects residing in memory fall into two classes: buffer objects and non-buffer objects. Unlike locks, latches are taken and released as needed and not held for the duration of a transaction; multiple latches can also be assessed on the same page.

Like locks, latches have various modes:

  • Destroy (DT) Latch: employed when destroying and evicting a buffer from cache.

  • Exclusive (EX) Latch: takes exclusive ownership of a page being written. Allows no other latches on the same page.

  • Keep (KP) Latch: similar to the purpose of an intent lock – to maintain a record of latch order as well as to make sure it stays in the buffer cache as another latch is being placed on it.

  • Shared (SH) Latch: applied when a read request to the page is granted.

  • Update (UP) Latch: similar to, but not as restrictive as an exclusive latch in that it allows reads against the page but prevents writes.

Furthermore there are two forms in which latches manifest themselves: basic latches and IO latches. The pure form of the latch occurs when the page is already in memory and just requires a new or additional latch applied. IO latches are applied when a page doesn’t exist in memory and needs to be acquired from disk and provisioned to the buffer cache.

 

SQL Server Waits

Ultimately latches and locks maintain consistency and order in the database and the memory objects associated with the SQL Server instance. They’re traffic cops. However, like traffic cops their application of order comes at a cost: performance. Locking and latching causes blocking which in turn slows the response time for completing transactions. A great number of factors contribute to overall performance in a database: internal factors such as schema architecture, indexing, stored procedure coding, query structure, and transaction isolation level all play a role. Externally sourced factors such as network latency, application code, hardware, and other factors come into play. Every time SQL Server has to wait for a resource it needs in order to satisfy a request it takes note of how long it had to wait, what it was waiting on in terms of the underlying resource, and what the object was it was waiting on for release of that resource. This information is referred to as SQL Server wait statistics and can be located in the sys.dm_os_wait_stats and sys.dm_os_waiting_tasks Dynamic Management Views as well as in the sys. dm_exec_session_wait_stats Dynamic Management View (new in SQL Server 2016). Each of these DMVs have a different take on the wait information collected:

  • dm_os_wait_stats: is the repository for cumulative wait information since the SQL Server service was restarted or the wait statistics were manually cleared. The waits are aggregated by wait type.

  • dm_os_waiting_tasks:provides information on what tasks are currently waiting on resources, again aggregated by wait type.

  • dm_exec_session_wait_stats: is the newest DMV for waits and it aggregates wait statistics in similar form to dm_os_wait_stats but also adds session_id as a level of aggregation so you can assess the wait types encountered by each active session on a SQL instance.link. For the sake of this discussion I want to focus on when the results of a query for wait statistics can have more insight provided with a subsequent query against dm_db_index_operational_stats so we’ll forgo a deep dive into the wait statistics querying for another article. Instead let’s use cumulative waits to present the case for the scope of this article. I cover this query in an article from 2015: Comprehensive Wait Statistics Query for SQL 2005 – 2016:

SET NOCOUNT ON;IF OBJECT_ID('tempdb..#dm_os_wait_stats','U') IS NOT NULL    DROP TABLE #dm_os_wait_stats;GOSELECT wait_type    , (wait_time_ms - signal_wait_time_ms) / 1000. AS wait_time_s    , waiting_tasks_count    , CASE waiting_tasks_count        WHEN 0 THEN 0        ELSE (wait_time_ms - signal_wait_time_ms) / waiting_tasks_count    END AS avg_wait_ms    , 100. * (wait_time_ms - signal_wait_time_ms) / SUM(wait_time_ms - signal_wait_time_ms) OVER ( ) AS pct    , ROW_NUMBER() OVER ( ORDER BY wait_time_ms DESC ) AS rnINTO #dm_os_wait_statsFROM sys.dm_os_wait_statsWHERE wait_type NOT IN (N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP',N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE',N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE',N'DBMIRROR_DBM_EVENT', N'DBMIRROR_EVENTS_QUEUE', N'DBMIRROR_WORKER_QUEUE',N'DBMIRRORING_CMD', N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE',N'EXECSYNC', N'FSAGENT', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX',N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE', N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE',N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP', N'LOGMGR_QUEUE', N'MEMORY_ALLOCATION_EXT', N'ONDEMAND_TASK_QUEUE',N'PREEMPTIVE_OS_LIBRARYOPS', N'PREEMPTIVE_OS_COMOPS', N'PREEMPTIVE_OS_CRYPTOPS',N'PREEMPTIVE_OS_PIPEOPS', N'PREEMPTIVE_OS_AUTHENTICATIONOPS',N'PREEMPTIVE_OS_GENERICOPS', N'PREEMPTIVE_OS_VERIFYTRUST',N'PREEMPTIVE_OS_FILEOPS', N'PREEMPTIVE_OS_DEVICEOPS',N'PWAIT_ALL_COMPONENTS_INITIALIZED', N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP',N'QDS_ASYNC_QUEUE',N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', N'REQUEST_FOR_DEADLOCK_SEARCH',N'RESOURCE_QUEUE', N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP',N'SLEEP_DCOMSTARTUP', N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY',N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP', N'SLEEP_SYSTEMTASK', N'SLEEP_TASK',N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT', N'SP_SERVER_DIAGNOSTICS_SLEEP',N'SQLTRACE_BUFFER_FLUSH', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'SQLTRACE_WAIT_ENTRIES',N'WAIT_FOR_RESULTS', N'WAITFOR', N'WAITFOR_TASKSHUTDOWN', N'WAIT_XTP_HOST_WAIT',N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN',N'XE_DISPATCHER_WAIT', N'XE_LIVE_TARGET_TVF', N'XE_TIMER_EVENT',N'PREEMPTIVE_SP_SERVER_DIAGNOSTICS', N'PREEMPTIVE_HADR_LEASE_MECHANISM', N'TRACEWRITE',N'PREEMPTIVE_OS_WRITEFILEGATHER', N'PREEMPTIVE_OS_LOOKUPACCOUNTSID', N'CXPACKET');WITH Waits AS     (     SELECT wait_type, wait_time_s, waiting_tasks_count, avg_wait_ms, pct, rn    FROM #dm_os_wait_stats    )SELECT W1.wait_type    , CAST(W1.wait_time_s AS DECIMAL(12, 1)) AS wait_time_s    , W1.waiting_tasks_count     , CAST(W1.avg_wait_ms AS DECIMAL(12, 1)) AS avg_wait_ms    , CAST(W1.pct AS DECIMAL(12, 1)) AS pct    , CAST(SUM(W2.pct) AS DECIMAL(12, 1)) AS running_pctFROM Waits AS W1INNER JOIN Waits AS W2 ON W2.rn <= W1.rnGROUP BY W1.rn    , W1.wait_type    , W1.waiting_tasks_count     , W1.avg_wait_ms    , W1.wait_time_s    , W1.pctHAVING SUM(W2.pct) - W1.pct < 95 /* percentage threshold */ORDER BY W1.pct DESC;IF OBJECT_ID('tempdb..#dm_os_wait_stats','U') IS NOT NULLDROP TABLE #dm_os_wait_stats;GOSET NOCOUNT OFF;

When it comes to waits associated with locks and latches they come in three flavors: lock waits prefixed with “LCK_”, IO latch waits prefixed with “PAGEIOLATCH_”, and latch waits prefixed with “PAGELATCH_”. Latch and lock waits include the type of lock in the wait type’s name.

In the case of this instance of SQL Server we’re experiencing a high level or waits associated with latching and locking as can be observed in the highlighted results shown above. That only gives us part of the picture though. It sets us up to get a better diagnosis of our performance issues resulting from latch and lock time but doesn’t give us a source. That’s where sys.dm_db_index_operational_stats comes in.

 

Locking and Latching Awareness in the Operational Stats DMF

dm_db_index_operational_stats presents the following columns associated with locking and latching in Microsoft SQL Server that we’ll be considering:

  • row_lock_count

  • row_lock_wait_count

  • row_lock_wait_in_ms

  • page_lock_count

  • page_lock_wait_count

  • page_lock_wait_in_ms

  • index_lock_promotion_attempt_count

  • index_lock_promotion_count

  • page_latch_wait_count

  • page_latch_wait_in_ms

  • page_io_latch_wait_count

  • page_io_latch_wait_in_ms

Identifying indexes involved in high latch and lock waits follows a simple protocol:

  1. If you identify that lock or latch waits are one of the most intensive waits on your instance in terms of average wait time encountered move on to diagnosis of the database(s) involved the latch or lock activity then…

  2. Drill into the detail of the objects and indexes involved in the most-frequent locking then…

Let’s look at each step.

Step 1: Identify the Databases Involved in Latch and Lock Waits

The following query will isolate those databases, objects, and indexes with the most latching and locking activity. Since this example primarily is encountering latch waits we’ll use the IO latches version out of the three options below (depending upon highest wait type seen from the initial wait stats query):

SELECT TOP 10DB_NAME(database_id) AS database_name     , OBJECT_NAME(object_id, database_id) AS table_name     , index_id     , partition_number     , row_lock_count     , row_lock_wait_in_ms     , CASE row_lock_wait_countWHEN 0 THEN row_lock_wait_in_msELSE row_lock_wait_in_ms / row_lock_wait_countEND AS avg_row_lock_wait_in_ms     , page_lock_count , page_lock_wait_in_ms     , CASE page_lock_countWHEN 0 THEN page_lock_wait_in_msELSE page_lock_wait_in_ms / page_lock_countEND AS avg_page_lock_wait_in_ms     , page_latch_wait_count     , page_latch_wait_in_ms , CASE page_latch_wait_countWHEN 0 THEN page_latch_wait_in_msELSE page_latch_wait_in_ms / page_latch_wait_countEND AS avg_page_latch_wait_in_ms     , page_io_latch_wait_count     , page_io_latch_wait_in_ms , CASE page_io_latch_wait_countWHEN 0 THEN page_io_latch_wait_in_msELSE page_io_latch_wait_in_ms / page_io_latch_wait_countEND AS avg_page_io_latch_wait_in_msFROM sys.dm_db_index_operational_stats(NULL, NULL, NULL, NULL)WHERE row_lock_wait_in_ms > 0OR page_lock_wait_in_ms > 0OR page_latch_wait_in_ms > 0OR page_io_latch_wait_in_ms > 0AND database_id > 4ORDER BY page_io_latch_wait_in_ms DESC;/*ALTERNATELY SORT ON ONE OF THE FOLLOWING:page_latch_wait_in_ms DESC --when PAGELATCH_% is your highest wait(row_lock_wait_in_ms + page_lock_wait_in_ms) DESC --whe a lock wait type is your highest wait*/

Note that if you want to only drill into the specific columns associated with those waits in question you could alternately narrow the column list as such:

---LOCKINGSELECT TOP 3DB_NAME(database_id) AS database_name     , OBJECT_NAME(object_id, database_id) AS table_name     , index_id     , partition_number     , row_lock_count     , row_lock_wait_in_ms     , CASE row_lock_wait_countWHEN 0 THEN row_lock_wait_in_msELSE row_lock_wait_in_ms / row_lock_wait_countEND AS avg_row_lock_wait_in_ms     , page_lock_count , page_lock_wait_in_ms     , CASE page_lock_countWHEN 0 THEN page_lock_wait_in_msELSE page_lock_wait_in_ms / page_lock_countEND AS avg_page_lock_wait_in_msFROM sys.dm_db_index_operational_stats(NULL, NULL, NULL, NULL) ORDER BY (row_lock_wait_in_ms + page_lock_wait_in_ms) DESC;--PAGELATCHSELECT TOP 3DB_NAME(database_id) AS database_name     , OBJECT_NAME(object_id, database_id) AS table_name     , index_id     , partition_number     , page_latch_wait_count     , page_latch_wait_in_ms , CASE page_latch_wait_countWHEN 0 THEN page_latch_wait_in_msELSE page_latch_wait_in_ms / page_latch_wait_countEND AS avg_page_latch_wait_in_msFROM sys.dm_db_index_operational_stats(NULL, NULL, NULL, NULL) ORDER BY page_latch_wait_in_ms DESC;--PAGEIOLATCHSELECT TOP 3DB_NAME(database_id) AS database_name     , OBJECT_NAME(object_id, database_id) AS table_name     , index_id     , partition_number     , page_io_latch_wait_count     , page_io_latch_wait_in_ms , CASE page_io_latch_wait_countWHEN 0 THEN page_io_latch_wait_in_msELSE page_io_latch_wait_in_ms / page_io_latch_wait_countEND AS avg_page_io_latch_wait_in_msFROM sys.dm_db_index_operational_stats(NULL, NULL, NULL, NULL) ORDER BY page_io_latch_wait_in_ms DESC;

Using the last of these narrow queries above you’ll see that I’m focused on identification of the objects and indexes involved in the first set of columns followed by count metrics for all latch and lock columns. I want to make sure I’m limiting my results to just rows that have applicable latch and lock activity and then, since the instance is primarily reporting latch waits as the most frequent wait type encountered I sort in descending order on latch time.

That provides the following results:

You may ask why I’m performing an initial discovery just to narrow the results to a database. This is because in order to identify the underlying index(s) involved I’ll need to be able to join dm_db_index_operational_stats to sys.indexes. dm_db_index_operational_stats is server-scoped meaning that it’s results span the boundaries of the database – no matter what database you execute a query against this DMF from you’ll net the same results. The same can’t be said of queries against sys.indexes. This is a system view that is unique to each database. It only returns results for the indexes within the database it’s called from. In order to join dm_db_index_operational_stats to sys.indexes we need to do so fully-qualifying sys.indexes with the database name. Both index_id and object_id are the join columns between these objects and they’re not unique across all databases. This means we need to first identify the database so we can move on to step 2 where we get the index specifics because we’re now armed with the database name.

Step 2: Identify the Index Details for the Latching/Locking Issues

Depending on the type of latch or lock wait we’re trying to focus on we’d run the query below with one of the three options for sorting to get more information on the index(es) involved.  In keeping with our example let’s continue to look at latches as our focus – specifically IO latches understanding that we would branch out and use any of the other diagnostic queries pertaining to lock or latch waits depending upon what our most frequent wait type is when we analyze wait stats for performance tuning:

--FURTHER CLARITY OF OBJECT NAMESSELECT TOP 4DB_NAME(ixOS.database_id) AS database_name     , OBJECT_NAME(ixOS.object_id, ixOS.database_id) AS table_name     , I.name AS index_name , I.type_desc AS index_type     , ixOS.partition_number     , ixOS.page_io_latch_wait_count     , ixOS.page_io_latch_wait_in_ms , CASE ixOS.page_io_latch_wait_countWHEN 0 THEN ixOS.page_io_latch_wait_in_msELSE ixOS.page_io_latch_wait_in_ms / ixOS.page_io_latch_wait_countEND AS avg_page_io_latch_wait_in_msFROM sys.dm_db_index_operational_stats(NULL, NULL, NULL, NULL) AS ixOSINNER JOIN lifeboat.sys.indexes AS I ON I.index_id = ixOS.index_id AND I.object_id = ixOS.object_idORDER BY ixOS.page_io_latch_wait_in_ms DESC;/*ALTERNATELY SORT ON ONE OF THE FOLLOWING:page_latch_wait_in_ms DESC --when PAGELATCH_% is your highest wait(row_lock_wait_in_ms + page_lock_wait_in_ms) DESC --whe a lock wait type is your highest wait*/

At this point we now know that the lifeboat..Database_Files_History.PK_Database_Files_History_1 index is to blame for a major portion of our IO latch waits. From here it’s time to do targeted troubleshooting at the index level. By utilizing dm_db_index_operational_stats you went from knowing that IO latch waits were your most pressing concern using waits as a performance tuning tool to knowing what specific object(s) are contributing to this issue.

Conclusion

Identifying your top wait types is only half the battle. Identifying the objects that contribute to those waits then targeting tuning on those objects is the next, crucial step in the performance tuning process. Thanks to the insights offered in dm_db_index_operational_stats you can identify those contributing tables and indexes easily.

About the Author

Tim Ford

Owner, SQL Cruise

http://thesqlagentman.com/

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like