Walking Through a Planned Failover: SQL Server Availability Groups on Kubernetes

Page content

When building the sql-on-k8s-operator, I wanted to make sure it could handle both planned and unplanned failovers. The easy case is a planned failover, where you deliberately move the primary role to another replica. The harder case is an unplanned failover, where the primary pod just disappears. The operator needs to handle both.

I recently ran a full planned failover rotation on a three-replica SQL Server Availability Group managed by sql-on-k8s-operator, and I want to show you exactly what happens inside SQL Server and the operator during each hop. If you’ve been following my Introducing the SQL Server on Kubernetes Operator post, this is the logical next step: what does the error log actually look like during a planned failover, what does the operator do in response, and how long does the whole thing take?

I ran the same three-hop rotation twice: once with an idle 5GB database to establish a baseline, and once under a sustained TPC-C workload with HammerDB. In this post, I’ll walk through the SQL Server error log entries, the operator’s reconcile behavior, and the timing data for both runs. In the next blog post, I’ll show what happens during an unplanned failover. Let’s go.

What We’re Testing

A planned failover is the normal maintenance operation for any DBA running an Availability Group. You pick a healthy, synchronized secondary, issue ALTER AVAILABILITY GROUP [AG1] FAILOVER on it, and the primary role moves. No data loss, no drama. I want to show that this works the same way when an operator is managing the AG on Kubernetes, and what the error log and operator logs look like at every stage.

The test executes a full rotation: mssql-ag-0 to mssql-ag-1, then mssql-ag-1 to mssql-ag-2, then mssql-ag-2 back to mssql-ag-0. Three hops, returning the primary to its original pod. Before every hop in this test we wait until all three replicas are confirmed HEALTHY and SYNCHRONIZED.

The Test Environment

The Availability Group is deployed with the SQLServerAvailabilityGroup custom resource using all three replicas in SynchronousCommit mode with Automatic failover, running on AKS with managed-csi-premium storage:

apiVersion: sql.mssql.microsoft.com/v1alpha1
kind: SQLServerAvailabilityGroup
metadata:
  name: mssql-ag
spec:
  agName: "AG1"
  image: mcr.microsoft.com/mssql/server:2025-CU3-ubuntu-22.04
  edition: Developer
  clusterType: EXTERNAL
  automaticFailover:
    enabled: true
    failoverThresholdSeconds: 30
    healthThreshold: system
  replicas:
    - name: primary
      availabilityMode: SynchronousCommit
      failoverMode: Automatic
    - name: secondary-1
      availabilityMode: SynchronousCommit
      failoverMode: Automatic
      readableSecondary: true
    - name: secondary-2
      availabilityMode: SynchronousCommit
      failoverMode: Automatic
      readableSecondary: true
  storage:
    dataVolumeSize: "20Gi"
    storageClassName: managed-csi-premium
    reclaimPolicy: Delete
  listener:
    name: mssql-ag-listener
    port: 1433
    serviceType: LoadBalancer
  readOnlyListener:
    name: mssql-ag-listener-ro
    port: 1433
    serviceType: LoadBalancer

Three synchronous-commit replicas, all of which are eligible for automatic failover. The operator manages the full lifecycle: bootstrap, certificate exchange, endpoint verification, and listener routing via pod labels.

How the Failover Command Works

When you issue a planned failover, the command runs on the target secondary, not the current primary. The T-SQL looks like this:

EXEC sp_set_session_context @key = N'external_cluster', @value = N'yes';
ALTER AVAILABILITY GROUP [AG1] FAILOVER;

The sp_set_session_context call is required because the AG is configured with CLUSTER_TYPE = EXTERNAL. That tells SQL Server the external cluster manager, in this case the operator, is authorizing the role change. Without it, SQL Server rejects the command. This is the same mechanism Microsoft’s mssql-server-ha resource agent uses with Pacemaker.

After the failover completes, the operator’s reconcile loop detects the new primary via sys.dm_hadr_availability_replica_states, updates pod labels (sql.mssql.microsoft.com/ag-role=primary), and the listener service selector automatically routes traffic to the new primary pod. No manual intervention required.

Planned Failover: A 5GB Database with No Load

First, the baseline. A 5GB tpcc database with no active transactions, all three replicas confirmed HEALTHY and SYNCHRONIZED before each hop. Here’s we’re measuring the time from the moment the ALTER AVAILABILITY GROUP FAILOVER command is issued until all three replicas report HEALTHY and SYNCHRONIZED again.

Hop Direction Time (s)
1 mssql-ag-0 -> mssql-ag-1 41
2 mssql-ag-1 -> mssql-ag-2 71
3 mssql-ag-2 -> mssql-ag-0 68

The Error Log During Promotion

Let’s look at what SQL Server writes to the error log when a secondary becomes the new primary. Here’s mssql-ag-2 during hop 2, the moment the FAILOVER command runs on it:

01:36:16  The state of the local availability replica in availability group 'AG1'
          has changed from 'SECONDARY_NORMAL' to 'RESOLVING_PENDING_FAILOVER'.
          The state changed because of a user initiated failover.
01:36:16  The local replica of availability group 'AG1' is preparing
          to transition to the primary role.
01:36:16  The state ... changed from 'RESOLVING_PENDING_FAILOVER' to 'RESOLVING_NORMAL'.
01:36:16  The state ... changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'.
01:36:16  The state ... changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'.
01:36:16  The availability group database "tpcc" is changing roles
          from "SECONDARY" to "RESOLVING" ... from "RESOLVING" to "PRIMARY"

All of that happens at 01:36:16, the same second. The state machine walks through SECONDARY_NORMAL to RESOLVING_PENDING_FAILOVER to RESOLVING_NORMAL to PRIMARY_PENDING to PRIMARY_NORMAL, and the database transitions from SECONDARY through RESOLVING to PRIMARY. The role promotion itself is nearly instantaneous.

The Error Log During Demotion

Here’s the other side. When mssql-ag-2 gives up the primary role during hop 3, the error log shows:

01:37:33  The local replica of availability group 'AG1' is preparing
          to transition to the resolving role.
01:37:33  The state ... changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'.
          The replica is going offline because the availability group is failing over
          to another SQL Server instance.
01:37:33  The availability group database "tpcc" is changing roles
          from "PRIMARY" to "RESOLVING" ... from "RESOLVING" to "SECONDARY"
01:37:33  The state ... changed from 'RESOLVING_NORMAL' to 'SECONDARY_NORMAL'.

Again, all in the same second. PRIMARY_NORMAL to RESOLVING_NORMAL to SECONDARY_NORMAL, and the database goes PRIMARY to RESOLVING to SECONDARY. The demotion is just as fast as the promotion.

What the Operator Does

In a planned failover, the operator doesn’t issue the failover command, that’s your job. What it does is detect the role change and update Kubernetes to match. Here’s the operator log for hop 2:

{"ts":"01:37:11","msg":"Reconciling SQLServerAvailabilityGroup"}
{"ts":"01:37:13","msg":"Updated pod AG role label","pod":"mssql-ag-2","role":"primary"}
{"ts":"01:37:13","msg":"Updated pod AG role label","pod":"mssql-ag-1","role":"readable-secondary"}

The reconcile loop runs, queries sys.dm_hadr_availability_replica_states, sees that mssql-ag-2 is now the primary, and patches the pod labels. The listener Service has a selector targeting ag-role=primary, so the moment the label changes, Kubernetes routes traffic to the new primary. The detection-and-relabel cycle takes about two seconds.

Between hops, the operator also re-seats replicas that are temporarily NOT SYNCHRONIZING:

{"ts":"01:37:20","msg":"Detected NOT SYNCHRONIZING secondary; re-seating with SET(ROLE=SECONDARY)","pod":"mssql-ag-0"}
{"ts":"01:37:20","msg":"Re-seated NOT SYNCHRONIZING replica","pod":"mssql-ag-0"}

A re-seat is when the operator issues ALTER AVAILABILITY GROUP [AG1] SET (ROLE = SECONDARY) on a secondary that has fallen out of synchronization. With CLUSTER_TYPE = EXTERNAL, SQL Server requires the external cluster manager to explicitly set the secondary role, it won’t self-assign it. Re-issuing SET (ROLE = SECONDARY) forces the replica to re-establish its database mirroring session with the new primary. It’s part of the normal convergence after a role change.

Planned Failover: Under TPC-C Load

In this test, we’re doing the same three-hop rotation, but with a sustained HammerDB TPC-C workload (50 warehouses) running against the primary through the listener service. The HammerDB process reconnects after each failover since the connection drops when the primary moves. Also during the test I’m running log backups every 30 seconds to keep the log reuse wait at 0 and also to potentially find any gaps in the log sequence during replication.

Hop Direction Time (s)
1 mssql-ag-0 -> mssql-ag-1 127
2 mssql-ag-1 -> mssql-ag-2 87
3 mssql-ag-2 -> mssql-ag-0 49

Every hop passed, but the times are longer and a little more variable, especially the first hop at 127 seconds. Under load, the redo queue on the secondaries has more work to flush before they can report SYNCHRONIZED. Its important to call out that the role transition itself is still nearly instantaneous. The extra time is all convergence. Let’s look closer.

The Role Transition Is Still Instant

Even under a TPC-C workload, the error log tells the same story. Here’s mssql-ag-2 during hop 2, promoting to primary:

02:35:25  The state of the local availability replica in availability group 'AG1'
          has changed from 'SECONDARY_NORMAL' to 'RESOLVING_PENDING_FAILOVER'.
          The state changed because of a user initiated failover.
02:35:25  The local replica of availability group 'AG1' is preparing
          to transition to the primary role.
02:35:25  The state ... changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'.
02:35:25  The availability group database "tpcc" is changing roles
          from "RESOLVING" to "PRIMARY"

Same sub-second promotion. The role transition itself isn’t what takes longer under load. It’s the convergence afterward: the secondaries need to drain their redo queues and re-establish SYNCHRONIZED state before the AG reports healthy.

The Operator Handles a Stuck Old Primary

Under load, the operator ran into something interesting during hop 1. The old primary (mssql-ag-0) didn’t cleanly release the PRIMARY role right away:

{"ts":"02:33:44","msg":"Detected NOT SYNCHRONIZING secondary; re-seating with SET(ROLE=SECONDARY)","pod":"mssql-ag-0"}
{"ts":"02:33:44","msg":"Msg 41104; pod still reports PRIMARY locally — issuing ALTER AG OFFLINE to reset state","pod":"mssql-ag-0"}

The operator saw that mssql-ag-0 was supposed to be a secondary but was still reporting PRIMARY locally, SQL Server error 41104. A split-brain scenario where the local replica thinks it’s primary but the AG as a whole disagrees. So it issued ALTER AVAILABILITY GROUP [AG1] OFFLINE on that former primary replica to force its stale primary state to clear, then re-seated the former primary replica so that it could update its role to SECONDARY. This is one of those edge cases the operator handles automatically that you’d otherwise have to catch and fix by hand.

After the OFFLINE reset, the next reconcile picked up cleanly:

{"ts":"02:33:52","msg":"Updated pod AG role label","pod":"mssql-ag-1","role":"primary"}
{"ts":"02:33:52","msg":"Updated pod AG role label","pod":"mssql-ag-0","role":"secondary"}

Comparing No-Load vs. Under-Load

Hop Direction No-Load (s) Under TPC-C (s)
1 ag-0 -> ag-1 41 127
2 ag-1 -> ag-2 71 87
3 ag-2 -> ag-0 68 49

The first hop under load takes 3x longer than idle. That’s the redo queue effect: with active transactions generating log records, the secondaries need to flush more data before reporting SYNCHRONIZED. The second and third hops converge, and the third hop under load is actually faster, likely because the redo queues had caught up during the previous wait.

The key insight is that the role transition itself is always sub-second, both under load and at idle. What varies is the convergence time, how long it takes all three replicas to get back to SYNCHRONIZED. That’s driven by transaction log activity, not Kubernetes overhead.

Wrapping Up

Planned failover on Kubernetes works exactly how you’d expect it to. The same ALTER AVAILABILITY GROUP FAILOVER T-SQL (well you have to use sp_set_session_context first), the same state transitions in the error log, the same redo queue behavior under load. The operator handles these scenarios properly by detecting the new primary via DMV queries, relabeling pods so the listener routes traffic correctly, and re-seating replicas that fall behind during the transition. Under load, it even catches edge cases like a stuck old primary and resolves them automatically. The error logs confirm that the role change itself is sub-second, and the timing data shows convergence is driven by workload activity, not the container platform. Clone the repo, run the tests against your cluster, and let me know how it works in your environment.