Best way to monitor high worker thread on SQL Server

It is common to see high worker count issue on the production server and it can have a range of negative impacts on SQL Server performance and stability, potentially leading to degraded user experience and even downtime. Some of the common reasons for high worker thread issues in SQL Server.

  • Blocking
  • Sudden workload spike
  • Resource contention

Problem Statement:

When it comes to monitoring worker thread count in SQL Server, most DBAs turn to the sys.dm_os_schedulers DMV, which is the most commonly used and reliable option. However, they often prefer to examine the sum of column current_workers_count from the DMV over a certain period, which may not accurately indicate how long the server was experiencing a high worker thread problem.

Monitor high worker thread count:

To simulate or monitor this, let’s capture worker thread count on SQL Server by executing below script. It will create the following objects inside the system database tempdb.

  • Table tbl_workerthreadcount_info is going to store the historical data.
  • SP sp_captureworkerthreadcount_info is going to capture required worker thread data.
  • Job Monitor_high_worker_thread is going to run every 10 seconds to capture the historical data.
USE tempdb GO IF OBJECT_ID('tempdb.dbo.tbl_workerthreadcount_info')>0 DROP TABLE tempdb.dbo.tbl_workerthreadcount_info IF OBJECT_ID('tempdb.dbo.sp_captureworkerthreadcount_info')>0 DROP PROCEDURE dbo.sp_captureworkerthreadcount_info GO --Create table CREATE TABLE [tempdb].[dbo].[tbl_workerthreadcount_info]( [capturedatetime] [datetime] NOT NULL, [total_current_workers_count] [int] NULL, [total_current_tasks_count] [int] NULL, [total_active_workers_count] [int] NULL ) ON [PRIMARY] GO --Create SP CREATE PROCEDURE [dbo].[sp_captureworkerthreadcount_info] AS BEGIN SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED INSERT INTO [tempdb].[dbo].[tbl_workerthreadcount_info] SELECT GETDATE() capturedatetime, sum(current_workers_count) total_current_workers_count , sum(current_tasks_count) total_current_tasks_count, sum(active_workers_count) total_active_workers_count FROM sys.dm_os_schedulers END GO --Create Job USE [msdb] GO IF EXISTS (SELECT job_id FROM msdb.dbo.sysjobs_view WHERE name = N'Monitor_high_worker_thread') EXEC msdb.dbo.sp_delete_job @job_name=N'Monitor_high_worker_thread' --, @delete_unused_schedule=1 GO /****** Object: Job [Monitor_high_worker_thread] Script Date: 31/3/2023 3:44:07 pm ******/ BEGIN TRANSACTION DECLARE @ReturnCode INT SELECT @ReturnCode = 0 /****** Object: JobCategory [[Uncategorized (Local)]] Script Date: 31/3/2023 3:44:07 pm ******/ IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'[Uncategorized (Local)]' AND category_class=1) BEGIN EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'[Uncategorized (Local)]' IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback END DECLARE @jobId BINARY(16) EXEC @ReturnCode = msdb.dbo.sp_add_job @job_name=N'Monitor_high_worker_thread', @enabled=1, @notify_level_eventlog=0, @notify_level_email=0, @notify_level_netsend=0, @notify_level_page=0, @delete_level=0, @description=N'No description available.', @category_name=N'[Uncategorized (Local)]', @owner_login_name=N'sa', @job_id = @jobId OUTPUT IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback /****** Object: Step [01_CaptureWorkerthreadCount] Script Date: 31/3/2023 3:44:07 pm ******/ EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'01_CaptureWorkerthreadCount', @step_id=1, @cmdexec_success_code=0, @on_success_action=3, @on_success_step_id=0, @on_fail_action=2, @on_fail_step_id=0, @retry_attempts=0, @retry_interval=0, @os_run_priority=0, @subsystem=N'TSQL', @command=N'EXEC [dbo].[sp_captureworkerthreadcount_info]', @database_name=N'tempdb', @flags=0 IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback /****** Object: Step [02_ArchivingHistoricalData] Script Date: 31/3/2023 3:44:07 pm ******/ EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'02_ArchivingHistoricalData', @step_id=2, @cmdexec_success_code=0, @on_success_action=1, @on_success_step_id=0, @on_fail_action=2, @on_fail_step_id=0, @retry_attempts=0, @retry_interval=0, @os_run_priority=0, @subsystem=N'TSQL', @command=N'DELETE FROM [tempdb].[dbo].[tbl_workerthreadcount_info] WHERE CaptureDateTime<=DATEADD(WEEK,-2,GETDATE()); ', @database_name=N'tempdb', @flags=0 IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1 IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback EXEC @ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N'CaptureWorkerthreadCount_Schd', @enabled=1, @freq_type=4, @freq_interval=1, @freq_subday_type=2, @freq_subday_interval=10, @freq_relative_interval=0, @freq_recurrence_factor=0, @active_start_date=20230331, @active_end_date=99991231, @active_start_time=0, @active_end_time=235959, @schedule_uid=N'73228b08-af4a-4d24-a888-5cd62a195714' IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)' IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback COMMIT TRANSACTION GOTO EndSave QuitWithRollback: IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION EndSave: GO

 

Generate stress (workload)

Now, let’s download and restore the database WideWorldImporters backup and create the below stored procedure.

USE WideWorldImporters
GO
CREATE OR ALTER PROC [dbo].[sel_ByCityName_Ostrees] 
AS
BEGIN
SELECT TOP 1 [CityID]
      ,[CityName]
      ,[StateProvinceID]
      ,[Location]
      ,[LatestRecordedPopulation]
      ,[LastEditedBy]
FROM [WideWorldImporters].[Application].[Cities]
order by NEWID()
END

To generate the load on the SQL Server instance, let’s use OStress tool to generate 500 concurrent sessions and each session is going to run the SP 500 times.

ostress -S"DESKTOP-BADCLEC\INST2K19" -E -Q"[sel_ByCityName_Ostrees];" -n500 -r500 -q -dWideWorldImporters

Collected Data Analysis:

Let’s analyze the collected the high worker thread count data. 

  • At 16:56:36.665, the OStress tool started creating worker threads. We can see from the snapshot below that the total_current_workers_count and other values went high on the spot on the SQL Server which indicated high worker thread problem.
  • At 17:04:27.266, the workload was cancelled to avoid the high worker thread problem on the server.
  • After the workload spike, the total_current_workers_count went down from 1000+ to 200+ worker thread count almost 1 hour later, even though the workload was cancelled within 10 mins.

Take Away:

A workload spike or contention will increase the current_workers_count, but it doesn’t go down on the spot after the workload spike or contention gets settled down. So, it isn’t recommended to monitor total_current_workers_count instead, we should monitor total_active_workers_count to know how long the server was experiencing a high worker thread problem.

Workers are allocated to the scheduler when there is work to be done, and the number of workers increases when there is a sudden surge in workload. As the workload reduces, workers become idle and wait for new tasks or requests to be assigned to them. General observation – workers will wait for ~15 minutes in an idle state and start to ramp down later. This allows us to have hot workers for incoming requests.

  • Active worker = bound to a task and executing a request.
  • Current worker count = total workers associated with scheduler.
  • Current task count = total tasks bound to workers or in the scheduler queue waiting to be bound to a worker.

Tasks are not an indicator of workers.  Task can be queued to the scheduler as the scheduler has a limited worker level.

  • Worker pool for scheduler = Workers waiting for work + workers bound to tasks
  • Task pool for scheduler = Tasks bound to workers + tasks queued to the scheduler (waiting for worker)

In general, you take the system’s max worker setting and divide by the number of visible schedulers. This is the target for the worker pool associated with an individual scheduler.

A connection is separate from a worker. The worker is bound to a task when it is considered active. A connection that is active has a pending request.

  • New connect request – Binds to task which binds to worker and performs the login.
  • Login complete – Connection remains open but worker and task returned to scheduler pool and can be used by other requests.
  • Connection Pool (.net) – Has active connection but no active request, does not consume worker or task for SQL Server.
  • Connection pool sends a request – Bound to new worker and task for the lifetime of the request. When a query is completed, the worker and task return to the scheduler pool.

Happy learning!

Leave a comment

Your email address will not be published. Required fields are marked *