feat: quorum-safe rolling restarts across nodepools #1141

josedev-union · 2025-10-20T13:54:06Z

Description

Quorum-safe rolling restarts across nodepools

implement global candidate selection for restart beyond one sts scope
enforce one deletion per reconcile (maxUnavailable=1)
guarantee one master at a time with cluster-wide quorum checks
keep role-aware path as fallback if no global candidate selected

Issues Resolved

#650
#738

Check List

Commits are signed per the DCO using --signoff
Unittest added for the new/changed functionality and all unit tests are successful
Customer-visible features documented
No linter warnings (make lint)

If CRDs are changed:

CRD YAMLs updated (make manifests) and also copied into the helm chart
Changes to CRDs documented

Please refer to the PR guidelines before submitting this pull request.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

josedev-union · 2025-10-21T07:25:24Z

@synhershko @prudhvigodithi
The story is more complex than the initial thought.
Rolling restart is interferred by clusterReconciler which has nodepool recovery logic causing STS recreation.

This PR is WIP but i'd welcome your feedback

- implement global candidate selection for restart beyond one sts scope - enforce one deletion per reconcile (maxUnavailable=1) - guarantee one master at a time with cluster-wide quorum checks - keep role-aware path as fallback if no global candidate selected Signed-off-by: josedev-union <[email protected]>

synhershko

The design doc needs updating

Also I'm missing additional safety changes - for example the following at least need to be discussed:

Do we want to run prechecks before the rolling restart starts, and what should they be? eg do we allow non green clusters to restart? do we require replicas to all indices?
Do we want to run any checks between node or group restart? eg do we want cluster stabilized and becoming green before proceeding to the next node or group?
What are failures scenarios in which we stop? what are the options we are going to have to rollback or rescue from failed upgrades?

synhershko · 2025-10-24T12:47:08Z

docs/designs/rolling-restart-improvements.md

@@ -0,0 +1,248 @@
+# Rolling Restart Improvements for Multi-AZ Master Nodes


This file describes the issues with the old implementation and suggestions on the new design. Since it's going to be written to docs/designs it really needs to describe the new design without referencing the old way or any existing issues

josedev-union requested review from prudhvigodithi and synhershko as code owners October 20, 2025 13:54

github-project-automation bot added this to Engineering Effectiveness Board Oct 20, 2025

josedev-union marked this pull request as draft October 21, 2025 07:18

josedev-union force-pushed the safe-rolling-update branch from 73330db to 8b5d069 Compare October 23, 2025 07:08

josedev-union force-pushed the safe-rolling-update branch from 8b5d069 to 53aa1ef Compare October 23, 2025 14:26

josedev-union marked this pull request as ready for review October 23, 2025 14:42

synhershko requested changes Oct 24, 2025

View reviewed changes

github-project-automation bot moved this to 👀 In Review in Engineering Effectiveness Board Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: quorum-safe rolling restarts across nodepools #1141

feat: quorum-safe rolling restarts across nodepools #1141

Uh oh!

josedev-union commented Oct 20, 2025

Uh oh!

josedev-union commented Oct 21, 2025

Uh oh!

synhershko left a comment

Uh oh!

synhershko Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,248 @@
		# Rolling Restart Improvements for Multi-AZ Master Nodes

feat: quorum-safe rolling restarts across nodepools #1141

Are you sure you want to change the base?

feat: quorum-safe rolling restarts across nodepools #1141

Uh oh!

Conversation

josedev-union commented Oct 20, 2025

Description

Issues Resolved

Check List

Uh oh!

josedev-union commented Oct 21, 2025

Uh oh!

synhershko left a comment

Choose a reason for hiding this comment

Uh oh!

synhershko Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants