Availability Groups

Walking Through an Unplanned Failover: SQL Server Availability Groups on Kubernetes

In my planned failover walkthrough, I showed what happens when you deliberately move the primary role to another replica. That’s the easy case. Now I want to show what happens when the primary pod just disappears unexpectedly, like during a node failure or a container crash. No graceful shutdown, no demotion, just gone.

I ran two test scenarios, each cycling the primary role across all three pods by force-deleting the current primary three times in a row. First, a 5GB TPC-C database idle. Then, that same 5GB database under sustained HammerDB TPC-C load. Six force-deletes total, six successful automatic failovers. I’ll walk through the error log from the promoted replica, the operator’s detection and recovery behavior, and the full timing data.

Walking Through a Planned Failover: SQL Server Always On Availability Groups on Kubernetes

When building the sql-on-k8s-operator, I wanted to make sure it could handle both planned and unplanned failovers. The easy case is a planned failover, where you deliberately move the primary role to another replica. The harder case is an unplanned failover, where the primary pod just disappears. The operator needs to handle both.

I recently ran a full planned failover rotation on a three-replica SQL Server Availability Group managed by sql-on-k8s-operator, and I want to show you exactly what happens inside SQL Server and the operator during each hop. If you’ve been following my Introducing the SQL Server on Kubernetes Operator post, this is the logical next step: what does the error log actually look like during a planned failover, what does the operator do in response, and how long does the whole thing take?

Introducing the SQL Server on Kubernetes Operator

Are you considering replatforming your SQL Server workload due to recent vendor changes, but still need high availability and disaster recovery? You’re not alone. One of the challenges with running SQL Server on Kubernetes is that there’s no Kubernetes operator available. That means no automated lifecycle management, no automatic failover, and no standard way to bootstrap an Always On Availability Group on Kubernetes.

I’m excited to share it today as an open-source project: sql-on-k8s-operator. Let’s go.

Seeding an Availability Group Replica from Snapshot

Background

If you’ve been using Availability Groups, you’re familiar with the replica seeding (sometimes called initializing, preparing or data sychronization) process. Seeding is a size of data operation, copying data from a primary replica to one or more secondary replicas. This is required before joining a database to an Availability Group. You can seed a replica with backup and restore or automatic seeding, each with its own challenges. Regardless of which method you use, the seeding operation can take a long amount of time. The time it takes to seed a replica is based on the size of the database, the speed of the network, and storage. If you have multiple replicas seeding all of them is N times the fun!

Speaking at SQL Saturday Pensacola!

I’m proud to announce that I will be speaking at SQL Saturday Pensacola on June 3rd 2017! Check out the amazing schedule!

If you don’t know what SQLSaturday is, it’s a whole day of free SQL Server training available to you at no cost!

If you haven’t been to a SQLSaturday, what are you waiting for! Sign up now!

My presentation is **“Designing High Availability Database Systems using AlwaysOn Availability Groups” **

Why Did Your Availability Group Creation Fail?

Availability Groups are a fantastic way to provide high availability and disaster recovery for your databases, but it isn’t exactly the easiest thing in the world to pull off correctly. To do it right there’s a lot of planning and effort that goes into your Availability Group topology. The funny thing about AGs is as hard as they are to plan…they’re pretty easy to implement…but sometimes things can go wrong. In this post I’m going to show you how to look into things when creating your AGs fails.

Monitoring SLAs with SQL Monitor Reporting

Proactive Reporting for SQL Server

If you’re a return reader of this blog you know I write often about monitoring and performance of Availability Groups. I’m a very big proponent of using monitoring techniques to ensure you’re meeting your service level agreements in terms of recovery time objective and recovery point objective. In my in person training sessions on “Performance Monitoring AlwaysOn Availability Groups”, I emphasize the need for knowing what your system’s baseline for healthy replication and knowing when your system deviates from that baseline. From a monitoring perspective, there are really two key concepts here I want to dig into…reactive monitoring and proactive monitoring.

Understanding Network Latency and Impact on Availability Group Replication

When designing Availability Group systems one of the first pieces of information I ask clients for is how much transaction log their databases generate. *Roughly*, this is going to account for how much data needs to move between their Availability Group Replicas. With that number we can start working towards the infrastructure requirements for their Availability Group system. I do this because I want to ensure the network has a sufficient amount of bandwidth to move the transaction log generated between all the replicas. Basically are the pipes big enough to handle the generated workload. But bandwidth is only part of the story, we also need to ensure latency is low. Why, well we’re going to explore that together in this post!

Using Extended Events to Visualize Availability Group Replication Internals

SQL 2014 Service Pack 2 was recently released by Microsoft and there is a ton of great new features and enhancements in this release.This isn’t just a collection of bug fixes…there’s some serious value in this Service Pack. Check out the full list here. One of the key things added in this Service Pack is an enhancement of the Extended Events for AlwaysOn Availability Group replication.

Why are the new Availability Group Extended Event interesting?

If you’ve used Availability Groups in production systems with high transaction volumes you know that replication latency can impact your availability. If you want to brush up on that check out our blog posts on AG Replication Latency, Monitoring for replication latency, and issues with the DMVs when monitoring. These new extended events add insight at nearly every point inside your Availability Group’s replication. More importantly they also include duration. So using these Extended Events we can pinpoint latency inside our Availability Group replication.

Availability Group DMVs Reporting Incorrect Values

In my opinion one of the key features of SQL Server 2016 is the rebuilt and optimized log redo mechanism for AlwaysOn Availability Groups. Check out the many new AG features here. Check out my posts here and here to learn about how Availability Groups move data.

Early last week I was conducting a load test using SQL Server 2016 and wanted to compare the performance of the log redo thread with that of SQL Server 2014. To establish baseline the performance of 2014, I constructed a load test using a heavy insert workload on the primary. To measure that workload I used the following script to pull database replication performance data from sys.dm_hadr_database_replica_states