-
Notifications
You must be signed in to change notification settings - Fork 293
feat: quorum-safe rolling restarts across nodepools #1141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
josedev-union
wants to merge
1
commit into
opensearch-project:main
Choose a base branch
from
data-ops-pulse:safe-rolling-update
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+674
−61
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,248 @@ | ||
| # Rolling Restart Improvements for Multi-AZ Master Nodes | ||
|
|
||
| ## Problem Statement | ||
|
|
||
| The OpenSearch Kubernetes operator had a critical issue where it would restart all node pools simultaneously during configuration changes. This violates OpenSearch cluster quorum requirements and causes cluster outages. | ||
|
|
||
| ### Issue Details | ||
|
|
||
| - **Issue #650**: [#650](https://github.com/opensearch-project/opensearch-k8s-operator/issues/650) - Master nodes restart simultaneously across availability zones | ||
| - **Issue #738**: [#738](https://github.com/opensearch-project/opensearch-k8s-operator/issues/738) - All data node pools restart at the same time | ||
| - **Root Cause**: The operator treated each node pool independently during rolling restarts without considering cluster-wide role distribution, quorum requirements, or proper sequencing | ||
| - **Impact**: Production cluster outages when nodes are configured across availability zones, causing cluster status to turn red during updates | ||
|
|
||
| ### Example Problematic Configuration | ||
|
|
||
| ```yaml | ||
| nodePools: | ||
| - component: master-a | ||
| replicas: 1 | ||
| roles: ["cluster_manager"] | ||
| - component: master-b | ||
| replicas: 1 | ||
| roles: ["cluster_manager"] | ||
| - component: master-c | ||
| replicas: 1 | ||
| roles: ["cluster_manager"] | ||
| - component: data-b | ||
| replicas: 2 | ||
| roles: ["data"] | ||
| - component: data-c | ||
| replicas: 2 | ||
| roles: ["data"] | ||
| ``` | ||
|
|
||
| ## Solution Overview | ||
|
|
||
| Implemented a **comprehensive global candidate rolling restart strategy** that: | ||
|
|
||
| 1. **Collects candidates across all StatefulSets** - Builds a global list of pods needing updates across all node types | ||
| 2. **Applies intelligent candidate selection** - Prioritizes data nodes over master nodes, then sorts by StatefulSet name and highest ordinal | ||
| 3. **Enforces master quorum preservation** - Ensures at least 2/3 masters are ready before restarting any master | ||
| 4. **Restarts one pod at a time** - Only deletes one pod per reconciliation loop to maintain precise control | ||
|
|
||
| ## Implementation Details | ||
|
|
||
| ### Global Candidate Collection | ||
|
|
||
| The new `globalCandidateRollingRestart()` function: | ||
|
|
||
| 1. **Iterates through all node pools** to find StatefulSets with pending updates | ||
| 2. **Identifies pods needing updates** by comparing `UpdateRevision` with pod labels | ||
| 3. **Builds a global candidate list** across all StatefulSets and availability zones | ||
|
|
||
| ### Intelligent Candidate Selection | ||
|
|
||
| Candidates are sorted using a proven algorithm that ensures optimal restart order: | ||
|
|
||
| ```go | ||
| // 1. Prioritize data nodes over master nodes for cluster stability | ||
| if !candidate.isMaster { | ||
| // Data nodes get higher priority to minimize cluster impact | ||
| } | ||
|
|
||
| // 2. Sort by StatefulSet name (deterministic ordering across AZs) | ||
| sort.Slice(candidates, func(i, j int) bool { | ||
| return candidates[i].sts.Name < candidates[j].sts.Name | ||
| }) | ||
|
|
||
| // 3. Within each StatefulSet, prefer highest ordinal (drain from top) | ||
| sort.Slice(candidates, func(i, j int) bool { | ||
| return candidates[i].ordinal > candidates[j].ordinal | ||
| }) | ||
| ``` | ||
|
|
||
| ### Master Quorum Preservation | ||
|
|
||
| Before restarting any master node: | ||
|
|
||
| ```go | ||
| // Calculate cluster-wide master quorum | ||
| totalMasters, readyMasters := r.calculateMasterQuorum() | ||
|
|
||
| // Require at least 2/3 masters to be ready | ||
| requiredMasters := (totalMasters + 1) / 2 | ||
| if readyMasters <= requiredMasters { | ||
| // Skip master restart to preserve quorum | ||
| continue | ||
| } | ||
| ``` | ||
|
|
||
| ### One-Pod-at-a-Time Restart | ||
|
|
||
| The operator now deletes only one pod per reconciliation loop: | ||
|
|
||
| ```go | ||
| // Find the best candidate | ||
| for _, candidate := range sortedCandidates { | ||
| if r.isCandidateEligible(candidate) { | ||
| return r.restartSpecificPod(candidate) | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### Key Functions | ||
|
|
||
| #### `globalCandidateRollingRestart()` | ||
| Main orchestrator function that: | ||
| - Collects all pods with pending updates across all StatefulSets and node types | ||
| - Applies intelligent candidate selection and sorting | ||
| - Enforces master quorum preservation | ||
| - Restarts one pod at a time per reconciliation loop | ||
|
|
||
| #### `restartSpecificPod()` | ||
| Handles the actual pod restart for a specific candidate: | ||
| - Performs the same prechecks as the original restart logic | ||
| - Deletes the specific pod to trigger StatefulSet rolling update | ||
| - Returns appropriate reconciliation result | ||
|
|
||
| #### `countMasters()` | ||
| Calculates cluster-wide master node quorum: | ||
| - Counts total master nodes across all master-eligible node pools | ||
| - Counts ready master nodes across all master-eligible node pools | ||
| - Used for quorum preservation decisions | ||
|
|
||
| #### `groupNodePoolsByRole()` | ||
| Groups node pools by role for analysis and logging: | ||
| - `dataOnly`: Node pools with only data role | ||
| - `dataAndMaster`: Node pools with both data and master roles | ||
| - `masterOnly`: Node pools with only master role | ||
| - `other`: Node pools with other roles (ingest, etc.) | ||
|
|
||
| ## Benefits | ||
|
|
||
| ### 1. **Comprehensive Cluster Stability** | ||
| - Prevents simultaneous restart of all master nodes | ||
| - Ensures data nodes restart before master nodes for optimal cluster health | ||
| - Maintains OpenSearch cluster quorum requirements | ||
| - Eliminates production outages during configuration changes | ||
|
|
||
| ### 2. **Multi-AZ and Multi-Node-Type Support** | ||
| - Properly handles all node types (master, data, coordinating, ingest) across multiple AZs | ||
| - Works with node provisioners like Karpenter that create separate node pools per AZ | ||
| - Ensures consistent restart behavior regardless of node distribution | ||
|
|
||
| ### 3. **Predictable and Controlled Behavior** | ||
| - Clear restart order: data nodes → coordinating nodes → master nodes | ||
| - One pod restart at a time for precise control | ||
| - Maintains cluster health and availability during updates | ||
|
|
||
| ### 4. **Backward Compatibility** | ||
| - No changes to existing API or configuration | ||
| - Works with existing cluster configurations | ||
|
|
||
| ## Testing | ||
|
|
||
| ### Unit Tests | ||
| - `TestGroupNodePoolsByRole()` - Validates role-based grouping logic used for analysis and logging | ||
| - `TestHasManagerRole()` / `TestHasDataRole()` - Validates role detection helper functions used throughout the implementation | ||
|
|
||
| ### Integration Testing | ||
| The implementation includes comprehensive test scenarios in `test-scenario-rolling-restart.md`: | ||
|
|
||
| 1. **Intelligent Candidate Selection** - Verifies data nodes restart before masters | ||
| 2. **Master Quorum Protection** - Ensures restart is blocked when < 2/3 masters ready | ||
| 3. **Multi-AZ Distribution** - Tests rolling restart across multiple availability zones | ||
| 4. **One-Pod-at-a-Time** - Confirms only one pod restarts per reconciliation loop | ||
| 5. **All Node Types** - Validates proper restart order for data, coordinating, and master nodes | ||
|
|
||
| ### Test Cluster Configuration | ||
| A multi-AZ test cluster is provided in `test-multi-az-cluster.yaml` with: | ||
| - 3 master node pools across different AZs | ||
| - 3 data node pools across different AZs | ||
| - 1 coordinating node pool | ||
| - Proper node selectors and tolerations for AZ distribution | ||
|
|
||
| ## Migration Guide | ||
|
|
||
| ### For Existing Clusters | ||
| No changes required. The new logic is automatically applied to existing clusters. | ||
|
|
||
| ### For New Clusters | ||
| Continue using the same configuration format. The operator will automatically apply role-aware rolling restarts. | ||
|
|
||
| ### Configuration Best Practices | ||
|
|
||
| 1. **Master Node Distribution** | ||
| ```yaml | ||
| # Recommended: Distribute masters across AZs | ||
| nodePools: | ||
| - component: master-az1 | ||
| replicas: 1 | ||
| roles: ["cluster_manager"] | ||
| - component: master-az2 | ||
| replicas: 1 | ||
| roles: ["cluster_manager"] | ||
| - component: master-az3 | ||
| replicas: 1 | ||
| roles: ["cluster_manager"] | ||
| ``` | ||
|
|
||
| 2. **Data Node Configuration** | ||
| ```yaml | ||
| # Data nodes can be in separate pools | ||
| nodePools: | ||
| - component: data-hot | ||
| replicas: 3 | ||
| roles: ["data", "data_hot"] | ||
| - component: data-warm | ||
| replicas: 2 | ||
| roles: ["data", "data_warm"] | ||
| ``` | ||
|
|
||
| ## Monitoring and Observability | ||
|
|
||
| ### Events | ||
| The operator now emits more detailed events during rolling restarts: | ||
| - `"Starting rolling restart"` - When restart begins | ||
| - `"Starting rolling restart of master node pool X"` - Master-specific restarts | ||
| - `"Skipping restart of master node pool X: insufficient quorum"` - Quorum preservation | ||
|
|
||
| ### Logs | ||
| Enhanced logging provides visibility into: | ||
| - Role-based grouping decisions | ||
| - Quorum calculations | ||
| - Restart priority decisions | ||
| - Cluster health checks | ||
|
|
||
| ## Future Enhancements | ||
|
|
||
| ### Potential Improvements | ||
| 1. **Configurable Restart Policies** - Allow users to customize restart behavior | ||
| 2. **Health Check Integration** - Use OpenSearch health API for more sophisticated decisions | ||
| 3. **Rollback Capabilities** - Automatic rollback if restart causes issues | ||
| 4. **Metrics Integration** - Expose restart metrics for monitoring | ||
|
|
||
| ### Configuration Options | ||
| Future versions could support: | ||
| ```yaml | ||
| spec: | ||
| rollingRestart: | ||
| policy: "role-aware" # or "legacy" | ||
| masterQuorumThreshold: 0.5 # Custom quorum threshold | ||
| maxConcurrentRestarts: 1 # Limit concurrent restarts | ||
| ``` | ||
|
|
||
| ## Conclusion | ||
|
|
||
| This implementation resolves the critical issue of simultaneous master node restarts while maintaining backward compatibility and improving overall cluster stability. The role-aware approach ensures that OpenSearch clusters remain available during configuration changes, especially in multi-availability zone deployments. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file describes the issues with the old implementation and suggestions on the new design. Since it's going to be written to docs/designs it really needs to describe the new design without referencing the old way or any existing issues