-
Notifications
You must be signed in to change notification settings - Fork 128
Add proposal for temporary preservation of machines #1031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
thiyyakat
wants to merge
11
commits into
gardener:master
Choose a base branch
from
thiyyakat:proposal/failed-machine-preserve
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+161
−0
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
ed48970
Add proposal for preservation of failed machines
thiyyakat 50e6fd1
Add limitations
thiyyakat 454422c
Address review comments
thiyyakat 961692a
Change mermaid layout from elk to default for github support
thiyyakat fc10934
Improve clarity
thiyyakat 309527f
Change proposal as per discussions
thiyyakat 078c710
Fix limitations
thiyyakat aa2ae8f
Add state diagrams
thiyyakat 9462118
Rename file and proposal
thiyyakat 849a99d
Update proposal to reflect changes decided in meeting
thiyyakat 227b3cd
Modify proposal to support use case for `preserve=when-failed`
thiyyakat File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
# Preservation of Machines | ||
|
||
<!-- TOC --> | ||
|
||
- [Preservation of Machines](#preservation-of-machines) | ||
- [Objective](#objective) | ||
- [Proposal](#proposal) | ||
- [State Diagrams](#state-diagrams) | ||
- [Use Cases](#use-cases) | ||
|
||
<!-- /TOC --> | ||
|
||
## Objective | ||
|
||
Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase. | ||
`Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult. | ||
|
||
Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler (CA) scaling down the node. | ||
|
||
This document proposes enhancing MCM, such that: | ||
* VMs of machines are retained temporarily for analysis | ||
* There is a configurable limit to the number of machines that can be preserved automatically on failure (auto-preservation) | ||
* There is a configurable limit to the duration for which machines are preserved | ||
* Users can specify which healthy machines they would like to preserve in case of failure, or for diagnoses in current state (prevent scale down by CA) | ||
* Users can request MCM to release a preserved machine, even before the timeout expires, so that MCM can transition the machine to either `Running` or `Terminating` phase, as the case may be. | ||
|
||
Related Issue: https://github.com/gardener/machine-controller-manager/issues/1008 | ||
|
||
## Proposal | ||
|
||
In order to achieve the objectives mentioned, the following are proposed: | ||
1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be auto-preserved, | ||
and the time duration for which these machines will be preserved. | ||
``` | ||
machineControllerManager: | ||
autoPreserveFailedMax: 0 | ||
machinePreserveTimeout: 72h | ||
``` | ||
* This configuration will be set per worker pool. | ||
* Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `machinePreserveMax` will be distributed across N machine deployments. | ||
* `machinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. | ||
* Example: if `machinePreserveMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1. | ||
2. MCM will be modified to include a new sub-phase `Preserved` to indicate that the machine has been preserved by MCM. | ||
3. Allow user/operator to request for preservation of a specific machine/node with the use of annotations : `node.machine.sapcloud.io/preserve=now` and `node.machine.sapcloud.io/preserve=when-failed`. | ||
4. When annotation `node.machine.sapcloud.io/preserve=now` is added to a `Running` machine, the following will take place: | ||
- `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down. | ||
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$ | ||
- The machine's phase is changed to `Running:Preserved` | ||
- After timeout, the `node.machine.sapcloud.io/preserve=now` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted, the machine phase is changed to `Running` and the CA may delete the node. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. | ||
5. When annotation `node.machine.sapcloud.io/preserve=when-failed` is added to a `Running` machine and the machine goes to `Failed`, the following will take place: | ||
- The machine is drained of pods except for Daemonset pods. | ||
- The machine phase is changed to `Failed:Preserved`. | ||
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$ | ||
- After timeout, the `node.machine.sapcloud.io/preserve=when-failed` is deleted. The phase is changed to `Terminating`. | ||
6. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMax` is not breached: | ||
- Pods (other than DaemonSet pods) are drained. | ||
- The machine's phase is changed to `Failed:Preserved`. | ||
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$ | ||
- After timeout, the phase is changed to `Terminating`. | ||
- Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMax`. | ||
7. If a failed machine is currently in `Failed:Preserved` and before timeout its VM/node is found to be Healthy, the machine will be moved to `Running`. | ||
8. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase using the annotation: `node.machine.sapcloud.io/preserve=false`. | ||
* MCM will move a machine thus annotated either to `Running` phase or `Terminating` depending on the phase of the machine before it was preserved. | ||
9. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. | ||
10. MCM will be modified to perform drain in `Failed` phase rather than `Terminating`. | ||
|
||
## State Diagrams: | ||
|
||
1. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=now`: | ||
```mermaid | ||
stateDiagram-v2 | ||
direction TBP | ||
state "Running" as R | ||
state "Running:Preserved" as RP | ||
[*]-->R | ||
R --> RP: annotated with preserve=now | ||
RP --> R: annotated with preserve=false or timeout occurs | ||
``` | ||
2. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=when-failed`: | ||
```mermaid | ||
stateDiagram-v2 | ||
state "Running" as R | ||
state "Running + Requested" as RR | ||
state "Failed | ||
(node drained)" as F | ||
state "Failed:Preserved" as P | ||
state "Terminating" as T | ||
[*]-->R | ||
R --> RR: annotated with preserve=when-failed | ||
RR --> F: on failure | ||
F --> P | ||
P --> T: on timeout or preserve=false | ||
P --> R: if node Healthy before timeout | ||
T --> [*] | ||
``` | ||
3. State Diagram for when an un-annotated `Running` machine fails (Auto-preservation): | ||
```mermaid | ||
stateDiagram-v2 | ||
direction TBP | ||
state "Running" as R | ||
state "Failed | ||
(node drained)" as F | ||
state "Failed:Preserved" as FP | ||
state "Terminating" as T | ||
[*] --> R | ||
R-->F: on failure | ||
F --> FP: if autoPreserveFailedMax not breached | ||
F --> T: if autoPreserveFailedMax breached | ||
FP --> T: on timeout or value=false | ||
FP --> R : if node Healthy before timeout | ||
T --> [*] | ||
``` | ||
|
||
## Use Cases: | ||
|
||
### Use Case 1: Preservation Request for Analysing Running Machine | ||
**Scenario:** Workload on machine failing. Operator wishes to diagnose. | ||
#### Steps: | ||
1. Operator annotates node with `node.machine.sapcloud.io/preserve=now` | ||
2. MCM preserves the machine, and prevents CA from scaling it down | ||
3. Operator analyzes the VM | ||
|
||
### Use Case 2: Proactive Preservation Request | ||
**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. | ||
#### Steps: | ||
1. Operator annotates node with `node.machine.sapcloud.io/preserve=when-failed` | ||
2. Machine fails later | ||
3. MCM preserves the machine | ||
4. Operator analyzes the VM | ||
|
||
|
||
### Use Case 3: Auto-Preservation | ||
**Scenario:** Machine fails unexpectedly, no prior annotation. | ||
#### Steps: | ||
1. Machine transitions to `Failed` phase | ||
2. Machine is drained | ||
3. If `autoPreserveFailedMax` is not breached, machine moved to `Failed:Preserved` phase by MCM | ||
4. After `machinePreserveTimeout`, machine is terminated by MCM | ||
|
||
### Use Case 4: Early Release | ||
**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved | ||
#### Steps: | ||
1. Machine is in `Running:Preserved` or `Failed:Preserved` phase | ||
2. Operator adds: `node.machine.sapcloud.io/preserve=false` to node. | ||
3. MCM transitions machine to `Running` or `Terminating`, for `Running:Preserved` or `Failed:Preserved` respectively, even though `machinePreserveTimeout` has not expired | ||
4. If machine was in `Failed:Preserved`, capacity becomes available for auto-preservation. | ||
|
||
## Points to Note | ||
|
||
1. During rolling updates MCM will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. | ||
2. Hibernation policy will override machine preservation. | ||
3. If Machine and Node annotation values differ for a particular annotation key, the Node annotation value will override the Machine annotation value. | ||
4. If `autoPreserveFailedMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones. | ||
5. In case of a scale down of an MCD's replica count, `Preserved` machines will be the last to be scaled down. Replica count will always be honoured. | ||
6. If the value for annotation key `cluster-autoscaler.kubernetes.io/scale-down-disabled` for a machine in `Running:Preserved` is changed to `false` by a user, the value will be overwritten to `true` by MCM. | ||
7. On increase/decrease of timeout, the new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines. | ||
8. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=now` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would: | ||
- harmonise machine flow | ||
- shield from CA's internals | ||
- make it generic and no longer CA specific | ||
- allow a timeout to be specified |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR: I would not restrict this feature to failing node.
Eg. it can happen that we detect problems with essentially all pods on a node but that this node doesn't report any condition failures (aka the node/machine will not be in failed state).
From an SRE perspective, we want to be as available as possible. Thus, in these kind of cases, we would cordon / drain the pods (except daemonSets) and start investigating the node. Furthermore, since expertise is spread around the globe, we sometimes need to keep a node in a cordoned state for 24-28 hours in order to investigate the root cause with the right area's expert. However, if a node is cordoned with not workload on it, it has very high chances to be scheduled for scale down by CA first.
Thus, this feature should also work for non failing nodes in order to cover all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will extend the feature to preserve non-failing nodes as well. Will update the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel, we should not generalize non-failing node, as it will introduce more complexity to the state management. We should only enable for cases where the node is cordoned and drained, also I'm guessing we need to do some changes in CA handling for unneeded machines as well as I don't see mentioned in this proposal.