Ezhil Arasu
Ezhil Arasu QA Architect at Encora Inc

How I quickly solved Azure SQL TempDB File contention with DBmarlin

How I quickly solved Azure SQL TempDB File contention with DBmarlin

This is a guest blog by Ezhil Arasu, QA Architect at Encora Inc. describing his recent experience of tuning Azure SQL.

Here, I explain how to find TempDB File contention (SQL SERVER) with or without DBmarlin - a lightweight smart monitoring tool for databases.

  • Environment : SQL Server 2019
  • Application type : .Net
  • Tools : JMeter, Grafana, DBmarlin and SQLQueryStress

In a recent upgrade, we changed the infra with a new SQL server, replacing the old one with exactly the same configuration in terms of CPU cores, CPU logical processors, memory, and storage in the cloud. When we ran our load tests with the defined load, we were able to see more failures/ latency when we exceeded a certain limit of Vusers.

Multiple iterations of run confirmed the same and Azure infra monitoring didn’t hint at anything useful. CPU/Memory Utilisation of App Server looks underutilised and it hints at something to investigate at the database end.

The below metrics are from the SQL server, which show few read and write operations during the load test time. Also, the CPU utilisation is around 80%, but doesn’t exceed above 80% consistently.

content/DBmarlin/azure-cpu+disk-metrics.png

DBmarlin for SQL Server Wait Times

I started looking at the DB side, using DBmarlin to monitor for the same period, and discovered interesting findings. DBmarlin helped much in the case of SQL server wait time statistics, which tell you how long DB is spending on specific waits.

content/DBmarlin/dbmarlin-top-waits.png

You can see that for the specific duration, most of the time DB was waiting for PAGELATCH_UP followed by running.

Now that you know a little bit about SQL server wait time analysis, you can skip to the next heading.

For beginners I will explain: usually a SQL server will have running, runnable and sleep states. When an app server is trying to execute a specific query, a SQL server DB engine will assign a thread to serve for this purpose. If no threads are available in the SQL server thread pool, then it denotes a CPU contention in the SQL server and it will be in the runnable state.

The ideal case is that it will be assigned with a thread. Now, the thread will take the query and parse it and it will look for the data to be read from the disk /buffer. The SQL server engine will put this thread in a suspended state and look for the data from the disk/buffer. Once the data is read from DB pages, then it will notify the thread (which is in a suspended state) that data is ready for processing and the thread will be moved to a runnable queue.

If any threads are available, they will be immediately moved to the running queue, data processing occurs and the requested data will be provided to the application server and the thread will be freed.

A SQL server keeps track of where the thread is spending time (suspended, runnable, running) and the wait types in the case of a suspended queue. These metrics will be stored in the DMV (dynamic management views) and can be retrieved by simple queries against the DMV. This data will reside in the DMV until the next SQL server engine restart.

The sys.dm_os_wait_stats dynamic management view can be used to obtain information about the wait type details. This view holds cumulative statistics about the wait types until the database engine is restarted or the wait statistics cleared manually.

Sample code from sqlhacks.com is below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
SELECT TOP 10 wait_type
AS [Wait Type], 
                wait_time_ms/1000.0
AS [Total Wait Time (second)], 
                (wait_time_ms-signal_wait_time_ms)/1000.0
AS [Resource Wait Time (second)], 
                signal_wait_time_ms/1000.0
AS [Signal Wait Time (second)], 
                waiting_tasks_count
AS [Wait Count]
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN
(N'CLR_SEMAPHORE', 
    N'LAZYWRITER_SLEEP', 
    N'RESOURCE_QUEUE', 
    N'SQLTRACE_BUFFER_FLUSH', 
    N'SLEEP_TASK', 
    N'SLEEP_SYSTEMTASK', 
    N'WAITFOR', 
    N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', 
    N'CHECKPOINT_QUEUE', 
    N'REQUEST_FOR_DEADLOCK_SEARCH', 
    N'XE_TIMER_EVENT', 
    N'XE_DISPATCHER_JOIN', 
    N'LOGMGR_QUEUE', 
    N'FT_IFTS_SCHEDULER_IDLE_WAIT', 
    N'BROKER_TASK_STOP', 
    N'CLR_MANUAL_EVENT', 
    N'CLR_AUTO_EVENT', 
    N'DISPATCHER_QUEUE_SEMAPHORE', 
    N'TRACEWRITE', 
    N'XE_DISPATCHER_WAIT', 
    N'BROKER_TO_FLUSH', 
    N'BROKER_EVENTHANDLER', 
    N'FT_IFTSHC_MUTEX', 
    N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 
    N'DIRTY_PAGE_POLL', 
    N'SP_SERVER_DIAGNOSTICS_SLEEP')
ORDER BY wait_time_ms-signal_wait_time_ms DESC;

This will provide results like below:

1
2
3
4
5
6
7
8
9
10
11
Wait Type	                        TotalWaitTime (s)	ResourceWaitTime (s)        Signal Wait Time (s)    Wait Count
QDS_PERSIST_TASK_MAIN_LOOP_SLEEP	14700.292000	        14700.285000	            0.007000	            246
PAGEIOLATCH_SH	                        564.933000	        563.587000	            1.346000	            47088
WRITELOG	                        149.947000	        149.598000	            0.349000	            6445
PAGEIOLATCH_EX	                        49.911000	        49.896000	            0.015000	            4486
PWAIT_ALL_COMPONENTS_INITIALIZED	31.028000	        31.028000	            0.000000	            5
PREEMPTIVE_OS_DEVICEOPS	                27.685000	        27.685000	            0.000000	            8041
IO_COMPLETION	                        21.395000	        21.380000	            0.015000	            4020
MEMORY_ALLOCATION_EXT	                16.901000	        16.901000	            0.000000	            7509349
LCK_M_U	                                12.684000	        12.683000	            0.001000	            26
WAIT_XTP_HOST_WAIT	                11.446000	        11.446000	            0.000000	            6

The same data can be visualised with DBmarlin as a pie chart. Refer to the images above.

DBmarlin Identifies the Problem

Now I know that DB worker threads spent more time on the suspended queue with most of the wait time occurring in PAGELATCH_UP wait time. With the help of DBmarlin, I can find which query is taking a longer time in this wait type.

content/DBmarlin/sql-pagelatch_up.png

This is one of the queries in a list that is causing a PAGELATCH_UP wait type event.

The PAGELATCH_UP wait type accumulates while page latches are in update mode. Update mode is not just for UPDATE operations; it’s used when the SQL needs to read and then modify a page. Before modification, it will place an update on the data when ready to upgrade this to an exclusive latch. If this wait type is high, then the common problem of PFS contention may be present alongside the tempdb related contention.

The rule of thumb is one tempDB file for four CPU cores.

When we created more tempDB files according to our infrastructure then ran a similar load, this wait type disappeared and we also experienced good response times.

The Azure snapshot below shows much improved DB performance. Look at the DB reads/writes per second below which are 2-3x higher than previously while CPU is down by more than 30%.

content/DBmarlin/azure-cpu+disk-metrics2.png

DBmarlin shows that waits for PAGELATCH_UP have disappeared after changing the number of TempDB files according to the rule of thumb above.

content/DBmarlin/dbmarlin-top-waits2.png

DBmarlin Knowledge base

Something new which the DBmarlin team just added in their 1.9 release is their Knowledge base which would have helped me find the answers I needed even faster. The knowledge base is a free resource containing the descriptions of over 820 wait events across SQL Server and other DBmarlin supported databases. For example in my case I could have clicked through from the PAGELATCH_UP in the product and been taken straight to the page below.

In conclusion

I am happy that I found DBmarlin when I did. It helped me to get to the root cause of the problem quickly. I really like the nice clear graphical view which is so much easier to work with than running queries against the SQL Sever DMVs directly. I would definitely recommend DBmarlin for anyone doing performance testing and especially those without deep database performance expertise as this tool will make your job a lot easier.

dbmarlin-trial-offer

This was originally published on LinkedIn here. https://www.linkedin.com/pulse/performance-engineering-tempdb-file-contention-dbmarlin-ezhil-arasu