Walking Through an Unplanned Failover: SQL Server Availability Groups on Kubernetes

Page content

In my planned failover walkthrough, I showed what happens when you deliberately move the primary role to another replica. That’s the easy case. Now I want to show what happens when the primary pod just disappears unexpectedly, like during a node failure or a container crash. No graceful shutdown, no demotion, just gone.

I ran two test scenarios, each cycling the primary role across all three pods by force-deleting the current primary three times in a row. First, a 5GB TPC-C database idle. Then, that same 5GB database under sustained HammerDB TPC-C load. Six force-deletes total, six successful automatic failovers. I’ll walk through the error log from the promoted replica, the operator’s detection and recovery behavior, and the full timing data.

What We’re Testing

With CLUSTER_TYPE = EXTERNAL, SQL Server won’t auto-promote on its own it relies on the external cluster manager, in this case the sql-on-k8s-operator, to detect the failure and issue the failover command. Each test kills the current primary with kubectl delete pod --force --grace-period=0 which kills the pod immediately with a SIGKILL and waits for the operator to promote a new primary and get all three replicas back to HEALTHY and SYNCHRONIZED. No manual intervention, no T-SQL issued by the test script.

The Test Environment

Same setup as the planned failover post. Three synchronous-commit replicas with automatic failover enabled, running on AKS with managed-csi-premium storage:

apiVersion: sql.mssql.microsoft.com/v1alpha1
kind: SQLServerAvailabilityGroup
metadata:
  name: mssql-ag
spec:
  agName: "AG1"
  image: mcr.microsoft.com/mssql/server:2025-CU3-ubuntu-22.04
  edition: Developer
  clusterType: EXTERNAL
  automaticFailover:
    enabled: true
    failoverThresholdSeconds: 30
    healthThreshold: system
  replicas:
    - name: primary
      availabilityMode: SynchronousCommit
      failoverMode: Automatic
    - name: secondary-1
      availabilityMode: SynchronousCommit
      failoverMode: Automatic
      readableSecondary: true
    - name: secondary-2
      availabilityMode: SynchronousCommit
      failoverMode: Automatic
      readableSecondary: true
  storage:
    dataVolumeSize: "20Gi"
    storageClassName: managed-csi-premium
    reclaimPolicy: Delete
  listener:
    name: mssql-ag-listener
    port: 1433
    serviceType: LoadBalancer
  readOnlyListener:
    name: mssql-ag-listener-ro
    port: 1433
    serviceType: LoadBalancer

How the Kill Works

The test script gets the current primary from the CR status and force-deletes it:

PRIMARY=$(kubectl get sqlag mssql-ag -o jsonpath='{.status.primaryReplica}')
kubectl delete pod "$PRIMARY" --grace-period=0 --force

--grace-period=0 --force bypasses the normal termination sequence: no SIGTERM, no preStop hook, no chance for SQL Server to demote gracefully. The container is killed immediately and the StatefulSet controller creates a replacement within seconds.

How the Operator Detects and Responds

The operator has two paths for handling an unplanned primary loss, and which one fires depends on how fast the pod comes back.

Path 1: NotReady threshold. If the primary pod is still NotReady when the operator reconciles, it starts a countdown timer (failoverThresholdSeconds). If the pod stays NotReady past the threshold, the operator selects the best synchronized secondary and issues ALTER AVAILABILITY GROUP [AG1] FAILOVER on it.

Path 2: Headless AG detection. If the StatefulSet recreates the pod fast enough that it’s already “Ready” by the next reconcile, the operator checks the pod’s actual SQL Server role via sys.dm_hadr_availability_replica_states. If the recorded primary is serving as SECONDARY or RESOLVING (because the restarted SQL Server instance comes back without the primary role), the operator scans all pods for the real primary. If another pod has assumed the role, it corrects its records. If no pod is primary, it issues an immediate failover.

Both paths use the same ALTER AVAILABILITY GROUP FAILOVER command with sp_set_session_context authorization. The operator never uses FORCE_FAILOVER_ALLOW_DATA_LOSS if the target secondary isn’t synchronized (error 41142), it retries on the next reconcile rather than risking data loss.

Across the six kills in this test,, Path 2 drove every actual promotion. The SQL Server state transition from SECONDARY_NORMAL to PRIMARY_NORMAL completed in tens of milliseconds on the newly elected primary. The recovery times reported below (63-99 seconds) measure time to full SYNCHRONIZED across all three replicas. The bulk is re-seating stuck replicas and the endpoint restart escalation.

Re-seating is the operator’s term for issuing ALTER AVAILABILITY GROUP SET (ROLE = SECONDARY) on a replica that’s stuck in NOT SYNCHRONIZING it forces the replica to drop its current session and re-join the AG as a secondary, which usually kicks automatic seeding back into gear without restarting the pod.

Path 1 isn’t purely theoretical, though. This is the path that will be used when a node failure takes out the primary pod. The StatefulSet can’t recreate the pod until the node is back online, so the primary will be NotReady for however long the node is down. In one of the six kills (TPCC-5G Kill 2, covered in detail below), the operator did log "Primary pod NotReady; starting failover threshold timer" for the killed pod. The timer was cleared 7 seconds later when the replacement pod became Ready. Path 2 had already issued the failover on a different pod in an earlier reconcile. I’ll cover a scenario where Path 1 actually drives the promotion in an upcoming post.

Unplanned Failover: TPCC-5G (No Load)

First, the baseline: a three-kill rotation against a 5GB TPC-C database with no active workload.

Kill Pod Killed New Primary Full Sync (s) Result
1 mssql-ag-0 mssql-ag-1 89 PASS
2 mssql-ag-1 mssql-ag-0 80 PASS
3 mssql-ag-0 mssql-ag-1 77 PASS

Recovery times are remarkably consistent here (77-89s), clustering tightly around 80 seconds. The 5GB database didn’t materially affect the recovery window because automatic seeding only needs to reconcile changes since the kill, not re-seed from scratch.

Unplanned Failover: TPCC-5G Under TPC-C Load

Same 5GB database, but now with a sustained HammerDB TPC-C workload (50 warehouses, 8 virtual users) running against the primary through the listener throughout the test. HammerDB is restarted after each failover to resume load on the new primary.

Kill Pod Killed New Primary Full Sync (s) Result
1 mssql-ag-0 mssql-ag-1 80 PASS
2 mssql-ag-1 mssql-ag-0 99 PASS
3 mssql-ag-0 mssql-ag-1 63 PASS

Under sustained load, the recovery window ranges from 63 to 99 seconds.

Anatomy of a Failover: TPCC-5G Under Load

To show what the escalation actually looks like end-to-end, here’s the deep dive on the TPCC-5G-under-load Kill 2: mssql-ag-1 (the primary) was force-deleted at 17:05:50 UTC, mssql-ag-0 was promoted, and all three replicas were back to SYNCHRONIZED 99 seconds later. This run is the most instructive because it exercises every stage the operator has to walk through: re-seat, HADR endpoint restart, and bilateral endpoint restart and it does so on two different secondaries.

The Error Log from the Newly Elected Primary

On mssql-ag-0, the full state machine transition runs in 40 milliseconds:

17:05:51.490  The state ... changed from 'SECONDARY_NORMAL' to 'RESOLVING_PENDING_FAILOVER'.
              The state changed because of a user initiated failover.

17:05:51.500  The local replica of availability group 'AG1' is preparing
              to transition to the primary role.

17:05:51.500  The state ... changed from 'RESOLVING_PENDING_FAILOVER' to 'RESOLVING_NORMAL'.
17:05:51.510  The state ... changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'.
17:05:51.510  The availability group database "tpcc" is changing roles
              from "SECONDARY" to "RESOLVING"
17:05:51.520  The state ... changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'.
17:05:51.530  The availability group database "tpcc" is changing roles
              from "RESOLVING" to "PRIMARY"

About 10 seconds later, the new primary hits connection timeouts to both other replicas as the killed pod restarts and the surviving secondary briefly drops its session:

17:06:01.540  A connection timeout has occurred while attempting to establish a connection
              to availability replica 'mssql-ag-2'
17:06:01.600  A connection timeout has occurred on a previously established connection
              to availability replica 'mssql-ag-1'

What the Operator Does

Within 1-2 seconds of the kill, the operator’s Path 2 detection runs the failover against mssql-ag-0. The SQL-side state transitions you saw above complete by 17:05:51.530, and the operator then finishes updating the pod role labels as the reconcile proceeds:

{"ts":"17:06:00","msg":"Updated pod AG role label","pod":"mssql-ag-1","role":"primary"}
{"ts":"17:06:00","msg":"Primary pod NotReady; starting failover threshold timer",
 "pod":"mssql-ag-1","threshold":30}
{"ts":"17:06:01","msg":"Updated pod AG role label","pod":"mssql-ag-0","role":"primary"}
{"ts":"17:06:01","msg":"Updated pod AG role label","pod":"mssql-ag-1","role":"readable-secondary"}
{"ts":"17:06:07","msg":"Primary pod recovered; clearing failover timer","pod":"mssql-ag-0"}

This is the Path 1 / Path 2 interaction I mentioned earlier. The NotReady threshold timer starts at 17:06:00 for the killed mssql-ag-1, but by the next reconcile the operator has already confirmed mssql-ag-0 is the new primary, flipped the labels, and cleared the timer at 17:06:07. Path 1 never issued a FAILOVER Path 2 already did.

The listener service now routes to mssql-ag-0.

The Recovery Phase

Now the long tail. Both secondaries show up as NOT SYNCHRONIZING: mssql-ag-1 because it was just force-restarted, and mssql-ag-2 because its replication session to the old primary got torn down. The operator re-seats each one with ALTER AVAILABILITY GROUP SET (ROLE = SECONDARY):

{"ts":"17:06:08","msg":"Detected NOT SYNCHRONIZING secondary; re-seating with SET(ROLE=SECONDARY)", "pod":"mssql-ag-1"}
{"ts":"17:06:09","msg":"Re-seated NOT SYNCHRONIZING replica","pod":"mssql-ag-1"}
{"ts":"17:06:09","msg":"Detected NOT SYNCHRONIZING secondary; re-seating with SET(ROLE=SECONDARY)", "pod":"mssql-ag-2"}
{"ts":"17:06:09","msg":"Re-seated NOT SYNCHRONIZING replica","pod":"mssql-ag-2"}

The re-seats succeed but neither replica recovers. After enough stuck time, the operator escalates. First it restarts the HADR endpoint on the stuck secondary, then when that still doesn’t clear the state it escalates to a bilateral endpoint restart by cycling the endpoint on both the stuck secondary and the primary to clear transport-level state on both sides:

{"ts":"17:06:35","msg":"Secondary persistently NOT SYNCHRONIZING despite successful re-seats; restarting HADR endpoint",
 "pod":"mssql-ag-1","stuckFor":26}

{"ts":"17:06:43","msg":"Secondary persistently NOT SYNCHRONIZING; escalating to bilateral endpoint restart",
 "pod":"mssql-ag-2","stuckFor":33}
{"ts":"17:06:50","msg":"Restarting HADR endpoint on primary to clear bilateral transport state",
 "primary":"mssql-ag-0"}

{"ts":"17:07:07","msg":"Secondary persistently NOT SYNCHRONIZING; escalating to bilateral endpoint restart",
 "pod":"mssql-ag-1","stuckFor":58}
{"ts":"17:07:14","msg":"Restarting HADR endpoint on primary to clear bilateral transport state",
 "primary":"mssql-ag-0"}

That’s the full escalation process: re-seat, HADR endpoint restart, and then bilateral endpoint restart. Each step is triggered by a stuckFor threshold, and the operator only climbs as far as it needs to. After the second bilateral restart, both secondaries reconnect, complete seeding, and the AG reaches full SYNCHRONIZED by 17:07:29 — 99 seconds after the kill.

Wrapping Up

In this post, the primary pod disappears, the operator’s headless AG detection promotes a new primary on the SQL-side in tens of milliseconds, and the reconcile loop flips the pod labels within a few seconds. The recovery escalation re-seating, HADR endpoint restart, bilateral endpoint restart runs automatically until all replicas reach SYNCHRONIZED. Six kills across two scenarios, all passed, with recovery ranging from 63 to 99 seconds. The failoverThresholdSeconds: 30 timer started exactly once (and was cleared 7 seconds later by Path 2); it’s there as insurance for network partitions and multi-pod failures where the surviving replicas can’t report the headless state.

Clone the repo, run the tests against your cluster, and let me know how it works in your environment.