Skip to content

Commit 849a99d

Browse files
committed
Update proposal to reflect changes decided in meeting
1 parent 9462118 commit 849a99d

File tree

1 file changed

+57
-78
lines changed

1 file changed

+57
-78
lines changed

docs/proposals/machine-preservation.md

Lines changed: 57 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -15,140 +15,119 @@
1515
Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase.
1616
`Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult.
1717

18-
Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler scaling down the node.
18+
Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler (CA) scaling down the node.
1919

2020
This document proposes enhancing MCM, such that:
2121
* VMs of machines are retained temporarily for analysis
22-
* There is a configurable limit to the number of machines that can be preserved
23-
* There is a configurable limit to the duration for which such machines are preserved
24-
* Users can specify which healthy machines they would like to preserve in case of failure
22+
* There is a configurable limit to the number of machines that can be preserved automatically on failure (auto-preservation)
23+
* There is a configurable limit to the duration for which machines are preserved
24+
* Users can specify which healthy machines they would like to preserve in case of failure, or for diagnoses in current state (prevent scale down by CA)
2525
* Users can request MCM to release a preserved machine, even before the timeout expires, so that MCM can transition the machine to either `Running` or `Terminating` phase, as the case may be.
2626

27+
Related Issue: https://github.com/gardener/machine-controller-manager/issues/1008
28+
2729
## Proposal
2830

2931
In order to achieve the objectives mentioned, the following are proposed:
30-
1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be preserved,
32+
1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be auto-preserved,
3133
and the time duration for which these machines will be preserved.
3234
```
3335
machineControllerManager:
34-
machinePreserveMax: 1
36+
autoPreserveFailedMax: 0
3537
machinePreserveTimeout: 72h
3638
```
3739
* This configuration will be set per worker pool.
3840
* Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `machinePreserveMax` will be distributed across N machine deployments.
3941
* `machinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
4042
* Example: if `machinePreserveMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.
41-
2. MCM will be modified to include a new phase `Preserved` to indicate that the machine has been preserved by MCM.
42-
3. Allow user/operator to request for preservation of a specific machine/node with the use of annotations : `node.machine.sapcloud.io/preserve=now` and `node.machine.sapcloud.io/preserve=when-failed`.
43-
4. When annotation `node.machine.sapcloud.io/preserve=now` is added to a `Running` machine, the following will take place:
43+
2. MCM will be modified to include a new sub-phase `Preserved` to indicate that the machine has been preserved by MCM.
44+
3. Allow user/operator to request for preservation of a specific machine/node with the use of annotation : `node.machine.sapcloud.io/preserve=true`.
45+
4. When annotation `node.machine.sapcloud.io/preserve=true` is added to a `Running` machine, the following will take place:
4446
- `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down.
4547
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$
46-
- The machine stage is changed to `Preserved`
47-
- After timeout, the `node.machine.sapcloud.io/preserve=now` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted, the machine phase is changed to `Running` and the CA may delete the node. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`.
48-
- Number of machines explicitly annotated will count towards enforcing `machinePreserveMax`. On breach, the annotation will be rejected.
49-
5. When annotation `node.machine.sapcloud.io/preserve=when-failed` is added to a `Running` machine and the machine goes to `Failed`, the following will take place:
50-
- The machine phase is changed to `Preserved`.
51-
- Pods (other than daemonset pods) are drained.
52-
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$
53-
- After timeout, the `node.machine.sapcloud.io/preserve=when-failed` is deleted. The phase is changed to `Terminating`.
54-
- Number of machines explicitly annotated will count towards enforcing `machinePreserveMax`. On breach, the annotation will be rejected.
55-
6. When an un-annotated machine goes to `Failed` phase and the $count(machinesAnnotatedForPreservation)+count(AutoPreservedMachines)<machinePreserveMax$
56-
- The machine's phase is changed to `Preserved`.
48+
- The machine's phase is changed to `Running:Preserved`
49+
- After timeout, the `node.machine.sapcloud.io/preserve=true` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted, the machine phase is changed to `Running` and the CA may delete the node. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`.
50+
5. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMax` is not breached:
5751
- Pods (other than DaemonSet pods) are drained.
52+
- The machine's phase is changed to `Failed:Preserved`.
5853
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$
5954
- After timeout, the phase is changed to `Terminating`.
60-
- Number of machines in `Preserved` phase count towards enforcing `machinePreserveMax`.
61-
- In the rest of the doc, the preservation of such un-annotated failed machines is referred to as **"auto-preservation"**.
62-
7. If a `Failed` machine is currently in `Preserved` and after timeout its VM/node is found to be Healthy, the machine will be moved to `Running`.
63-
8. A user/operator can request MCM to stop preserving a machine/node in `Preserved` stage using the annotation: `node.machine.sapcloud.io/preserve=false`.
64-
* For a machine thus annotated, MCM will move it either to `Running` phase or `Terminating` depending on the phase of the machine before it was moved to `Preserved`.
65-
9. Machines of a MachineDeployment in `Preserved` stage will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment.
66-
10. At any point in time $count(machinesAnnotatedForPreservation)+count(PreservedMachines)<=machinePreserveMax$.
55+
- Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMax`.
56+
6. If a failed machine is currently in `Failed:Preserved` and after timeout its VM/node is found to be Healthy, the machine will be moved to `Running`.
57+
7. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase using the annotation: `node.machine.sapcloud.io/preserve=false`.
58+
* MCM will move a machine thus annotated either to `Running` phase or `Terminating` depending on the phase of the machine before it was preserved.
59+
8. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment.
60+
9. MCM will be modified to perform drain in `Failed` phase rather than `Terminating`.
6761

6862
## State Diagrams:
6963

70-
1. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=now`:
64+
1. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=true`:
7165
```mermaid
7266
stateDiagram-v2
7367
direction TBP
7468
state "Running" as R
75-
state "Preserved" as P
69+
state "Running:Preserved" as RP
7670
[*]-->R
77-
R --> P: annotated with value=now && max not breached
78-
P --> R: annotated with value=false or timeout occurs
71+
R --> RP: annotated with preserve=true
72+
RP --> R: annotated with preserve=false or timeout occurs
7973
```
8074

81-
2. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=when-failed`:
82-
```mermaid
83-
stateDiagram-v2
84-
state "Running" as R
85-
state "Running + Requested" as RR
86-
state "Failed" as F
87-
state "Preserved
88-
(node drained)" as P
89-
state "Terminating" as T
90-
[*]-->R
91-
R --> RR: annotated with value=when-failed && max not breached
92-
RR --> F: on failure
93-
F --> P
94-
P --> T: on timeout or value=false
95-
P --> R: if node Healthy before timeout
96-
T --> [*]
97-
```
98-
99-
3. State Diagram for when an un-annotated `Running` machine fails:
75+
2. State Diagram for when an un-annotated `Running` machine fails (Auto-preservation):
10076
```mermaid
10177
stateDiagram-v2
10278
direction TBP
10379
state "Running" as R
104-
state "Failed" as F
105-
state "Preserved" as P
80+
state "Failed
81+
(node drained)" as F
82+
state "Failed:Preserved" as FP
10683
state "Terminating" as T
10784
[*] --> R
10885
R-->F: on failure
109-
F --> P: if max not breached
110-
F --> T: if max breached
111-
P --> T: on timeout or value=false
112-
P --> R : if node Healthy before timeout
86+
F --> FP: if autoPreserveFailedMax not breached
87+
F --> T: if autoPreserveFailedMax breached
88+
FP --> T: on timeout or value=false
89+
FP --> R : if node Healthy before timeout
11390
T --> [*]
11491
```
11592

116-
11793
## Use Cases:
11894

11995
### Use Case 1: Proactive Preservation Request
12096
**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis.
12197
#### Steps:
122-
1. Operator annotates node with `node.machine.sapcloud.io/preserve=when-failed`, provided `machinePreserveMax` is not violated
123-
2. Machine fails later
124-
3. MCM preserves the machine
125-
4. Operator analyzes the failed VM
98+
1. Operator annotates node with `node.machine.sapcloud.io/preserve=true`
99+
2. MCM preserves the machine, and prevents CA from scaling it down
100+
3. Operator analyzes the VM
126101

127-
### Use Case 2: Automatic Preservation
102+
### Use Case 2: Auto-Preservation
128103
**Scenario:** Machine fails unexpectedly, no prior annotation.
129104
#### Steps:
130105
1. Machine transitions to `Failed` phase
131-
2. If `machinePreserveMax` is not breached, machine moved to `Preserved` phase by MCM
132-
3. After `machinePreserveTimeout`, machine is terminated by MCM
133-
134-
### Use Case 3: Preservation Request for Analysing Running Machine
135-
**Scenario:** Workload on machine failing. Operator wishes to diagnose.
136-
#### Steps:
137-
1. Operator annotates node with `node.machine.sapcloud.io/preserve=now`, provided `machinePreserveMax` is not violated
138-
2. MCM preserves machine and prevents CA from scaling it down
139-
3. Operator analyzes the machine
106+
2. Machine is drained
107+
3. If `autoPreserveFailedMax` is not breached, machine moved to `Failed:Preserved` phase by MCM
108+
4. After `machinePreserveTimeout`, machine is terminated by MCM
140109

141-
### Use Case 4: Early Release
110+
### Use Case 3: Early Release
142111
**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved
143112
#### Steps:
144-
1. Machine is in `Preserved` phase
113+
1. Machine is in `Running:Preserved` or `Failed:Preserved` phase
145114
2. Operator adds: `node.machine.sapcloud.io/preserve=false` to node.
146-
3. MCM transitions machine to `Running` or `Terminating`, depending on which phase it was in before moving to `Preserved`, even though `machinePreserveTimeout` has not expired
147-
4. Capacity becomes available for preserving future annotated machines or for auto-preservation of `Failed` machines.
115+
3. MCM transitions machine to `Running` or `Terminating`, for `Running:Preserved` or `Failed:Preserved` respectively, even though `machinePreserveTimeout` has not expired
116+
4. If machine was in `Failed:Preserved`, capacity becomes available for auto-preservation.
148117

149118

150-
## Limitations
119+
## Points to Note
151120

152121
1. During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase.
153-
2. Since gardener worker pool can correspond to 1..N MachineDeployments depending on number of zones, we will need to distribute the `machinePreserveMax` across N machine deployments.
154-
So, even if there are no failed machines preserved in other zones, the max per zone would still be enforced. Hence, the value of `machinePreserveMax` should be chosen appropriately.
122+
2. Hibernation policy would override machine preservation.
123+
3. If Machine and Node annotation values differ for a particular annotation key (including `node.machine.sapcloud.io/preserve=true`), the Node annotation value will override the Machine annotation value.
124+
4. If `autoPreserveFailedMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones.
125+
5. In case of a scale down of an MCD's replica count, `Preserved` machines will be the last to be scaled down. Replica count will always be honoured.
126+
6. If the value for annotation key `cluster-autoscaler.kubernetes.io/scale-down-disabled` for a machine in `Running:Preserved` is changed to `false` by a user, the value will be overwritten to `true` by MCM.
127+
7. On increase/decrease of timeout- new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines.
128+
- can specify timeout
129+
8. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=true` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would:
130+
- harmonise machine flow
131+
- shield from CA's internals
132+
- make it generic and no longer CA specific
133+
- allow a timeout to be specified

0 commit comments

Comments
 (0)