Availability Group DMVs Reporting Incorrect Values

In my opinion one of the key features of SQL Server 2016 is the rebuilt and optimized log redo mechanism for AlwaysOn Availability Groups. Check out the many new AG features here. Check out my posts here and here to learn about how Availability Groups move data.

Early last week I was conducting a load test using SQL Server 2016 and wanted to compare the performance of the log redo thread with that of SQL Server 2014. To establish baseline the performance of 2014, I constructed a load test using a heavy insert workload on the primary. To measure that workload I used the following script to pull database replication performance data from sys.dm_hadr_database_replica_states

SELECT r.replica_server_name
      , DB_NAME(rs.database_id) AS [DatabaseName]
      , rs.log_send_queue_size
      , rs.log_send_rate
      , rs.redo_queue_size
      , rs.redo_rate
FROM   sys.dm_hadr_database_replica_states rs
        JOIN sys.availability_replicas r ON r.group_id = rs.group_id
             AND r.replica_id = rs.replica_id
 WHERE  DB_NAME(rs.database_id) = 'TestAG1'
 ORDER BY r.replica_server_name;

Oh NO! The DMV data is incorrect!!!

What I found was alarming, the DMV is reporting incorrect values for log_send_rate and redo_rate! This test was on conducted on SQL Server 2014 SP1 build 12.00.4100. This was truly unfortunate because I found this issue on a Monday and I was giving a talk on Performance Monitoring Availability Groups THAT SATURDAY at SQL Saturday #484 – Chicago. So off I went to find a new data source for the presentation and completely rewrite all the demos. The first thing I did was checked Connect, yep there it was on there. There’s also a mention of a fix in CU5 here. But I was on a build newer than that already, SP1. So I installed 2014 SP1 CU5…and yup…still bad data.

Getting the right data from the right place

So, how did I solve this little issue? Perfmon! There are two performance counters in perfmon that hold the same data as the DMVs. So I’ll query those for the load test, SQLServer:Availability Replica and SQLServer:Database Replica. So I went about modifying the script to pull the data from perfmon so I could have accurate data for the talk that Saturday!!!

The technique is pretty straight forward. You can query perfmon using t-sql from the sys.dm_os_performance_counters DMV. But to do so you need to know that there are different types of counters and those types report their data differently. The specific counters I want to query are called “delta counters” as described by Jason Strate here. The counters we’re querying, specifically ‘Redone Bytes/sec’ and ‘Log Bytes/sec’ report on a one second average. So I take a sample, wait one second, take another sample and difference the two values and that is the actual value of what happened in that interval.

So, we now have our own calculated log_send_rate and redo_rate. So now we need to work them back into the monitoring code that I already have coded. We also need to take into account where this data is being sampled from, so there’s a CASE statement in there too. So below is the code to pull the correct data from perfmon for the redo and send queue from perfmon and also additional Availability Group performance data such as send queue size and redo queue size.

Calculating Availability Group Latency

With the correct values being reported, we can also calculate send and redo latency. In other words, how long in time a replica is behind the primary. We do this by dividing the queue size by the rate for both the send queue and the redo queue. This is a pretty cool value basically telling you how long it will take your replica to catch up to the primary.

DECLARE @redo1 bigint, @redo2 bigint, @redo_rate float, @send1 bigint, @send2 bigint, @send_rate float

SET @redo1 = (SELECT cntr_value FROM sys.dm_os_performance_counters
				WHERE [object_name] = 'SQLServer:Database Replica' and instance_name = 'TestAG1' and counter_name = 'Redone Bytes/sec')

SET @send1 = (SELECT cntr_value FROM sys.dm_os_performance_counters
				WHERE [object_name] = 'SQLServer:Database Replica' and instance_name = 'TestAG1' and counter_name = 'Log Bytes Received/sec')

WAITFOR DELAY '00:00:01'

SET @redo2 = (SELECT cntr_value FROM sys.dm_os_performance_counters
				WHERE [object_name] = 'SQLServer:Database Replica' and instance_name = 'TestAG1' and counter_name = 'Redone Bytes/sec')

SET @send2 = (SELECT cntr_value FROM sys.dm_os_performance_counters
				WHERE [object_name] = 'SQLServer:Database Replica' and instance_name = 'TestAG1' and counter_name = 'Log Bytes Received/sec')

SET @redo_rate = (SELECT @redo2 - @redo1)
SET @send_rate = (SELECT @send2 - @send1)

SELECT  r.replica_server_name
      , DB_NAME(rs.database_id) AS [DatabaseName]
      , rs.log_send_queue_size
      , rs.log_send_rate [log_send_rate - dmv]
	  , @send_rate / 1024.0 [log_send_rate KB - perfmon]
	  , CASE WHEN rs.is_local != 1 THEN NULL ELSE (CONVERT(DECIMAL(10,2), log_send_queue_size / CASE WHEN @send_rate = 0 THEN 1 ELSE @send_rate / 1024.0 END)) END [send_latency - sec] --Limit to two decimals, queue is KB, convert @send_rate to KB
      , rs.redo_queue_size
      , rs.redo_rate [redo_rate - dmv]
	  , @redo_rate / 1024.0 [redo_rate KB - perfmon]
	  , CASE WHEN rs.is_local != 1 THEN NULL ELSE (CONVERT(DECIMAL(10,2), rs.redo_queue_size / CASE WHEN @redo_rate = 0 THEN 1 ELSE @redo_rate / 1024.0 END)) END [redo_latency - sec] --Limit to two decimals, queue is KB, convert @redo_rate to KB
FROM    sys.dm_hadr_database_replica_states rs
        JOIN sys.availability_replicas r ON r.group_id = rs.group_id
                                            AND r.replica_id = rs.replica_id
WHERE   DB_NAME(rs.database_id) = 'TestAG1'
ORDER BY r.replica_server_name
GO

A quick demo

With a small workload running generating about 7MB of transaction log as reported by the counter SQLServer:Databases – Log Bytes Flushed/sec here’s what we see…returning from the query defined above on a system running SQL 2014 SP1 CU5.

From the result set you can see that the DMV is reporting about **140MB/sec **for log_send_rate to both replicas, but the perfmon counter is reporting around 7MB/sec for that same value. I confirmed the usage by also looking actual network utilization.

The important message here is to validate your data, in conducting those load tests 2014, I knew something wasn’t right with the data when the send rate was 112Gb/sec! The code above shows both the DMV value and the value derived from perfmon.

Good luck, and happy monitoring!

Want to chat about SQL Server or Availability Groups, feel free to drop me a line at [aen@centinosystems.com][9]

[9]: mailto:aen@centinosystems.com?subject=AGs%20Reporting%20Incorrect Values

Oh NO! The DMV data is incorrect!!!** **

Getting the right data from the right place

Calculating Availability Group Latency

A quick demo

Oh NO! The DMV data is incorrect!!!