Using Extended Events to Visualize Availability Group Replication Internals

SQL 2014 Service Pack 2 was recently released by Microsoft and there is a ton of great new features and enhancements in this release.This isn’t just a collection of bug fixes…there’s some serious value in this Service Pack. Check out the full list here. One of the key things added in this Service Pack is an enhancement of the Extended Events for AlwaysOn Availability Group replication.

Why are the new Availability Group Extended Event interesting?

If you’ve used Availability Groups in production systems with high transaction volumes you know that replication latency can impact your availability. If you want to brush up on that check out our blog posts on AG Replication Latency, Monitoring for replication latency, and issues with the DMVs when monitoring. These new extended events add insight at nearly every point inside your Availability Group’s replication. More importantly they also include duration. So using these Extended Events we can pinpoint latency inside our Availability Group replication.

Here’s the list and description of the new Extended Events:

hadr_log_block_group_commit – Log block group commit processing duration
log_block_pushed_to_logpool – Log block start push to log pool
log_flush_start – Asynchronous log write start
hadr_log_block_compression – Log block compression processing duration
hadr_capture_log_block – Primary has captured a log block
hadr_capture_filestream_wait
ucs_connection_send_msg – UCS transport connection sent message
hadr_log_block_send_complete – After a log block message has been sent. This event is only used for fail points
log_flush_complete – Reports that synchronous send is complete
hadr_receive_harden_lsn_message – When a new hardened LSN from the secondary
hadr_db_commit_mgr_harden – Transaction commit harden result from Hadron Commit management
hadr_transport_receive_log_block_message – Fires when we’re receiving new log block message
hadr_log_block_decompression – Log block decompression processing duration
hadr_apply_log_block – Secondary is going to append a log block to the log
hadr_send_harden_lsn_message – Crafting a message to send containing a new hardened LSN on a secondary. Test only
hadr_lsn_send_complete – After an LSN message has been sent.

The source for this list and it’s descriptions is Amit Banerjee’s HADR Virtual Chapter Talk and KB3173156

Using Extended Events to visualize Availability Group Replication Internals

Using these Extended Events, we’re able to get some insight into the replication internals of our Availability Groups. Like I described a second ago, we can use these to measure the duration of each event and sift out performance issues in our replication. But we can also use them to get a better understanding of what goes into Availability Group replication, let’s see how.

Here’s the steps I used to generate the chart below:

Let’s add an Event Session with all of these new Events above to the primary and secondary replicas of a two replica synchronous Availability Group
Then with the sessions running, insert only one row on the primary replica. This will cause the AG to replicate the data change event and trigger the new Extended Events on both the primary and secondary replicas
Stop the Event Sessions
Open the Event file in SSMS and View Target Data on both the primary and secondary replicas
Order the Event data by timestamp
Merge the event data together based on the timestamp (I did this in a spreadsheet)

With this list we have the entire replication sequence of Events for that single insert replicating data from the primary to the secondary from the perspective of both the primary and secondary. I took that information and made the following graphic for you visualizing what it takes to replicate data in a synchronous Availability Group.

## AG Synchrnous Replication Internals

Figure 1: Availability Group Replication Extended Events

From Figure 1, let’s walk through the events…

On the primary

The first event that fires is hadr_log_block_group_commit starting things off. It’s my understanding that this initializes the needed code and program state for AG replication
Then up next is log_block_pushed_to_logpool. This fires when a log block is copied to the log pool. The log pool is a special memory location used to store log blocks that may need to be read again, since it’s in memory it keeps the readers from doing a physical IO on the transaction log
Then log_flush_start, this fires when log blocks are flushed to disk, normal write ahead logging
Once the log block has been copied into the log pool and flushed to disk, it’s read from the log pool and compressed then the hadr_log_block_compression event fires
After it’s compressed the AG “captures” the log block for replication to the secondary with hadr_capture_log_block
Then builds and sends the message to the secondary and ucs_connection_send_msg fires
Once the message is sent the hadr_log_block_send_complete event fires
We mark the synchronous send complete with log_flush_complete

On the secondary

On the other side of the fence, basically the inverse occurs. We mark the receipt of the message with hadr_transport_receive_log_block_message
The log block is decompressed and hadr_log_block_decompression fires
Then the log block is appended to the local log buffer and hadr_apply_log_block fires
Then the log block is copied to the local log pool and hadr_pushed_to_logpool fires
Next, the asynchronous log write happens and log_flush_start marks that start of that
And once the asynchronous log write finishes, log_flush_complete fires
Once hardened locally we need to craft a message to send to the primary marking the the LSN is hardened and hadr_send_harden_lsn_message fires
ucs_connection_send_msg then transmits fires when the message is sent to the primary
hadr_lsn_send_complete marks the fires marking the end of sending the message

Back on the primary

hadr_receive_harden_lsn_message fires marking that the secondary has acknowledging receipt of the send LSN
And to wrap it all up hadr_db_commit_mgr_harden marks the end of the log block replication for this sequence

A few important notes…

Many of these Extended Events fire at different points and multiple times during replication under different conditions. There’s a field called mode, which is an integer value, that marks various conditions for the events. I have yet to decode what each value of mode is for all of the events but will update this blog as I get some insight.
It’s very important to note that this example highlights an insert into an Availability Group with a primary and only one secondary configured in synchronous Availability Mode. The code path and the subsequent Extended Events that fire can change based on Availability Mode (sync/async), replication health, and many other conditions that can impact replication.

I hope that I’ve shown you some interesting concepts that you can use to help troubleshoot your Availability Group replication using Extended Events. This post, highlights how you can use Extended Events to get some insight into the the internals of Availability Group replication.

If you have any questions about this or your Availability Group configuration or performance please feel free to email me at aen@centinosystems.com

Please follow me on Twitter @nocentino to keep up with new blog posts

References

Great references on the internals of the log pool – https://sqlactions.com/2014/03/31/how-it-works-logpool/

HADR Virtual Chapter – http://hadrvc.sqlpass.org/Home.aspx?EventID=5304