Skip to content

Conversation

@synhershko
Copy link
Collaborator

When nodes are drained during K8S or OS upgrades, pods get deleted and recreated, but the StatefulSet revision doesn't change. The existing logic incorrectly triggered rolling restarts in these scenarios, causing all pods with lower ordinal numbers to be terminated simultaneously.

This change adds revision comparison checks to prevent rolling restarts when StatefulSet current and update revisions match, indicating no actual spec changes occurred. Rolling restarts now only happen when there are legitimate configuration updates requiring pod recreation.

Changes:

  • Add revision match checks in RollingRestartReconciler.Reconcile()
  • Enhance WorkingPodForRollingRestart() to skip when revisions match
  • Add debug logging to indicate when rolling restarts are skipped

Fixes issues during node upgrades and other node draining scenarios where pods are recreated without spec changes. Fixes #312.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

When nodes are drained during K8S or OS upgrades, pods get deleted and
recreated, but the StatefulSet revision doesn't change. The existing
logic incorrectly triggered rolling restarts in these scenarios, causing
all pods with lower ordinal numbers to be terminated simultaneously.

This change adds revision comparison checks to prevent rolling restarts
when StatefulSet current and update revisions match, indicating no
actual spec changes occurred. Rolling restarts now only happen when
there are legitimate configuration updates requiring pod recreation.

Changes:
- Add revision match checks in RollingRestartReconciler.Reconcile()
- Enhance WorkingPodForRollingRestart() to skip when revisions match
- Add debug logging to indicate when rolling restarts are skipped

Fixes issues during node upgrades and other node draining scenarios
where pods are recreated without spec changes.

Signed-off-by: Itamar Syn-Hershko <[email protected]>
Comment on lines 487 to 489
if sts.Status.UpdateRevision != "" && sts.Status.CurrentRevision == sts.Status.UpdateRevision {
return "", errors.New("no rolling restart needed - StatefulSet revisions match")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is bubbled up till the reconcile func. It means this error triggers reconcile requeue.
I think we don't need to requeue.
Now this func doesn't have response case with nil error and empty pod name. Therefore, the caller parts of this func just check if err==nil and not check working pod name.

I'd rather set the condition check for working pod name separately. Then we can return nil error for this non-revision change case with empty working pod name.
@synhershko what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me - so what is your proposed behavior for node draining?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a commit to avoid requeue by returning nil for non-actionable states.

Btw, during the code change and digging the RollingRestartReconciler deeper, I had a doubt if this change is really needed.

As you can see these current code snippets, there are condition checks to avoid rollout for such a draining case.

// Only restart pods if not all pods are updated and the sts is healthy with no pods terminating
if sts.Status.ReadyReplicas == ptr.Deref(sts.Spec.Replicas, 1) {
if numReadyPods, err := helpers.CountRunningPodsForNodePool(r.client, r.instance, &nodePool); err == nil && numReadyPods == int(ptr.Deref(sts.Spec.Replicas, 1)) {
lg.Info(fmt.Sprintf("Starting rolling restart of the StatefulSet %s", sts.Name))
return r.restartStatefulSetPod(&sts)
}

if podRevision != sts.Status.UpdateRevision {
return &pod, nil
}

I deployed 3-agent k3s cluster on my local lab and tried to reproduce #312. But i couldn't reproduce it.

  • deploy os cluster with a 3-replica nodepool with master and data role
  • drain one k8s node where -2 os replica runs
  • -2 replica rescheduled on other node and other replicas keep running without restart

I can see that the original issue report was made with version 2.2.1 which is too old.
From the above tests and code check, I can say this issue resolved already.

@synhershko what do you thnk? do you experience this issue still in your clusters? if so, which version do you use?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't happen to me, and wasn't able to reproduce it on our test harness, but the issue seems legit. Maybe it's some sort of a race condition.

@josedev-union josedev-union force-pushed the fix/prevent-unnecessary-rolling-restarts-on-node-drain branch from 4f95b71 to 7397951 Compare October 23, 2025 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Operator terminates nodes via restart after k8s node removed via spot request termination.

3 participants