|
15 | 15 | Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase.
|
16 | 16 | `Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult.
|
17 | 17 |
|
18 |
| -Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler scaling down the node. |
| 18 | +Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler (CA) scaling down the node. |
19 | 19 |
|
20 | 20 | This document proposes enhancing MCM, such that:
|
21 | 21 | * VMs of machines are retained temporarily for analysis
|
22 |
| -* There is a configurable limit to the number of machines that can be preserved |
23 |
| -* There is a configurable limit to the duration for which such machines are preserved |
24 |
| -* Users can specify which healthy machines they would like to preserve in case of failure |
| 22 | +* There is a configurable limit to the number of machines that can be preserved automatically on failure (auto-preservation) |
| 23 | +* There is a configurable limit to the duration for which machines are preserved |
| 24 | +* Users can specify which healthy machines they would like to preserve in case of failure, or for diagnoses in current state (prevent scale down by CA) |
25 | 25 | * Users can request MCM to release a preserved machine, even before the timeout expires, so that MCM can transition the machine to either `Running` or `Terminating` phase, as the case may be.
|
26 | 26 |
|
| 27 | +Related Issue: https://github.com/gardener/machine-controller-manager/issues/1008 |
| 28 | + |
27 | 29 | ## Proposal
|
28 | 30 |
|
29 | 31 | In order to achieve the objectives mentioned, the following are proposed:
|
30 |
| -1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be preserved, |
| 32 | +1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be auto-preserved, |
31 | 33 | and the time duration for which these machines will be preserved.
|
32 | 34 | ```
|
33 | 35 | machineControllerManager:
|
34 |
| - machinePreserveMax: 1 |
| 36 | + autoPreserveFailedMax: 0 |
35 | 37 | machinePreserveTimeout: 72h
|
36 | 38 | ```
|
37 | 39 | * This configuration will be set per worker pool.
|
38 | 40 | * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `machinePreserveMax` will be distributed across N machine deployments.
|
39 | 41 | * `machinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
|
40 | 42 | * Example: if `machinePreserveMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.
|
41 |
| -2. MCM will be modified to include a new phase `Preserved` to indicate that the machine has been preserved by MCM. |
42 |
| -3. Allow user/operator to request for preservation of a specific machine/node with the use of annotations : `node.machine.sapcloud.io/preserve=now` and `node.machine.sapcloud.io/preserve=when-failed`. |
43 |
| -4. When annotation `node.machine.sapcloud.io/preserve=now` is added to a `Running` machine, the following will take place: |
| 43 | +2. MCM will be modified to include a new sub-phase `Preserved` to indicate that the machine has been preserved by MCM. |
| 44 | +3. Allow user/operator to request for preservation of a specific machine/node with the use of annotation : `node.machine.sapcloud.io/preserve=true`. |
| 45 | +4. When annotation `node.machine.sapcloud.io/preserve=true` is added to a `Running` machine, the following will take place: |
44 | 46 | - `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down.
|
45 | 47 | - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$
|
46 |
| - - The machine stage is changed to `Preserved` |
47 |
| - - After timeout, the `node.machine.sapcloud.io/preserve=now` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted, the machine phase is changed to `Running` and the CA may delete the node. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. |
48 |
| - - Number of machines explicitly annotated will count towards enforcing `machinePreserveMax`. On breach, the annotation will be rejected. |
49 |
| -5. When annotation `node.machine.sapcloud.io/preserve=when-failed` is added to a `Running` machine and the machine goes to `Failed`, the following will take place: |
50 |
| - - The machine phase is changed to `Preserved`. |
51 |
| - - Pods (other than daemonset pods) are drained. |
52 |
| - - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$ |
53 |
| - - After timeout, the `node.machine.sapcloud.io/preserve=when-failed` is deleted. The phase is changed to `Terminating`. |
54 |
| - - Number of machines explicitly annotated will count towards enforcing `machinePreserveMax`. On breach, the annotation will be rejected. |
55 |
| -6. When an un-annotated machine goes to `Failed` phase and the $count(machinesAnnotatedForPreservation)+count(AutoPreservedMachines)<machinePreserveMax$ |
56 |
| - - The machine's phase is changed to `Preserved`. |
| 48 | + - The machine's phase is changed to `Running:Preserved` |
| 49 | + - After timeout, the `node.machine.sapcloud.io/preserve=true` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted, the machine phase is changed to `Running` and the CA may delete the node. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. |
| 50 | +5. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMax` is not breached: |
57 | 51 | - Pods (other than DaemonSet pods) are drained.
|
| 52 | + - The machine's phase is changed to `Failed:Preserved`. |
58 | 53 | - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$
|
59 | 54 | - After timeout, the phase is changed to `Terminating`.
|
60 |
| - - Number of machines in `Preserved` phase count towards enforcing `machinePreserveMax`. |
61 |
| - - In the rest of the doc, the preservation of such un-annotated failed machines is referred to as **"auto-preservation"**. |
62 |
| -7. If a `Failed` machine is currently in `Preserved` and after timeout its VM/node is found to be Healthy, the machine will be moved to `Running`. |
63 |
| -8. A user/operator can request MCM to stop preserving a machine/node in `Preserved` stage using the annotation: `node.machine.sapcloud.io/preserve=false`. |
64 |
| - * For a machine thus annotated, MCM will move it either to `Running` phase or `Terminating` depending on the phase of the machine before it was moved to `Preserved`. |
65 |
| -9. Machines of a MachineDeployment in `Preserved` stage will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. |
66 |
| -10. At any point in time $count(machinesAnnotatedForPreservation)+count(PreservedMachines)<=machinePreserveMax$. |
| 55 | + - Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMax`. |
| 56 | +6. If a failed machine is currently in `Failed:Preserved` and after timeout its VM/node is found to be Healthy, the machine will be moved to `Running`. |
| 57 | +7. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase using the annotation: `node.machine.sapcloud.io/preserve=false`. |
| 58 | + * MCM will move a machine thus annotated either to `Running` phase or `Terminating` depending on the phase of the machine before it was preserved. |
| 59 | +8. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. |
| 60 | +9. MCM will be modified to perform drain in `Failed` phase rather than `Terminating`. |
67 | 61 |
|
68 | 62 | ## State Diagrams:
|
69 | 63 |
|
70 |
| -1. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=now`: |
| 64 | +1. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=true`: |
71 | 65 | ```mermaid
|
72 | 66 | stateDiagram-v2
|
73 | 67 | direction TBP
|
74 | 68 | state "Running" as R
|
75 |
| - state "Preserved" as P |
| 69 | + state "Running:Preserved" as RP |
76 | 70 | [*]-->R
|
77 |
| - R --> P: annotated with value=now && max not breached |
78 |
| - P --> R: annotated with value=false or timeout occurs |
| 71 | + R --> RP: annotated with preserve=true |
| 72 | + RP --> R: annotated with preserve=false or timeout occurs |
79 | 73 | ```
|
80 | 74 |
|
81 |
| -2. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=when-failed`: |
82 |
| -```mermaid |
83 |
| -stateDiagram-v2 |
84 |
| - state "Running" as R |
85 |
| - state "Running + Requested" as RR |
86 |
| - state "Failed" as F |
87 |
| - state "Preserved |
88 |
| - (node drained)" as P |
89 |
| - state "Terminating" as T |
90 |
| - [*]-->R |
91 |
| - R --> RR: annotated with value=when-failed && max not breached |
92 |
| - RR --> F: on failure |
93 |
| - F --> P |
94 |
| - P --> T: on timeout or value=false |
95 |
| - P --> R: if node Healthy before timeout |
96 |
| - T --> [*] |
97 |
| -``` |
98 |
| - |
99 |
| -3. State Diagram for when an un-annotated `Running` machine fails: |
| 75 | +2. State Diagram for when an un-annotated `Running` machine fails (Auto-preservation): |
100 | 76 | ```mermaid
|
101 | 77 | stateDiagram-v2
|
102 | 78 | direction TBP
|
103 | 79 | state "Running" as R
|
104 |
| - state "Failed" as F |
105 |
| - state "Preserved" as P |
| 80 | + state "Failed |
| 81 | + (node drained)" as F |
| 82 | + state "Failed:Preserved" as FP |
106 | 83 | state "Terminating" as T
|
107 | 84 | [*] --> R
|
108 | 85 | R-->F: on failure
|
109 |
| - F --> P: if max not breached |
110 |
| - F --> T: if max breached |
111 |
| - P --> T: on timeout or value=false |
112 |
| - P --> R : if node Healthy before timeout |
| 86 | + F --> FP: if autoPreserveFailedMax not breached |
| 87 | + F --> T: if autoPreserveFailedMax breached |
| 88 | + FP --> T: on timeout or value=false |
| 89 | + FP --> R : if node Healthy before timeout |
113 | 90 | T --> [*]
|
114 | 91 | ```
|
115 | 92 |
|
116 |
| - |
117 | 93 | ## Use Cases:
|
118 | 94 |
|
119 | 95 | ### Use Case 1: Proactive Preservation Request
|
120 | 96 | **Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis.
|
121 | 97 | #### Steps:
|
122 |
| -1. Operator annotates node with `node.machine.sapcloud.io/preserve=when-failed`, provided `machinePreserveMax` is not violated |
123 |
| -2. Machine fails later |
124 |
| -3. MCM preserves the machine |
125 |
| -4. Operator analyzes the failed VM |
| 98 | +1. Operator annotates node with `node.machine.sapcloud.io/preserve=true` |
| 99 | +2. MCM preserves the machine, and prevents CA from scaling it down |
| 100 | +3. Operator analyzes the VM |
126 | 101 |
|
127 |
| -### Use Case 2: Automatic Preservation |
| 102 | +### Use Case 2: Auto-Preservation |
128 | 103 | **Scenario:** Machine fails unexpectedly, no prior annotation.
|
129 | 104 | #### Steps:
|
130 | 105 | 1. Machine transitions to `Failed` phase
|
131 |
| -2. If `machinePreserveMax` is not breached, machine moved to `Preserved` phase by MCM |
132 |
| -3. After `machinePreserveTimeout`, machine is terminated by MCM |
133 |
| - |
134 |
| -### Use Case 3: Preservation Request for Analysing Running Machine |
135 |
| -**Scenario:** Workload on machine failing. Operator wishes to diagnose. |
136 |
| -#### Steps: |
137 |
| -1. Operator annotates node with `node.machine.sapcloud.io/preserve=now`, provided `machinePreserveMax` is not violated |
138 |
| -2. MCM preserves machine and prevents CA from scaling it down |
139 |
| -3. Operator analyzes the machine |
| 106 | +2. Machine is drained |
| 107 | +3. If `autoPreserveFailedMax` is not breached, machine moved to `Failed:Preserved` phase by MCM |
| 108 | +4. After `machinePreserveTimeout`, machine is terminated by MCM |
140 | 109 |
|
141 |
| -### Use Case 4: Early Release |
| 110 | +### Use Case 3: Early Release |
142 | 111 | **Scenario:** Operator has performed his analysis and no longer requires machine to be preserved
|
143 | 112 | #### Steps:
|
144 |
| -1. Machine is in `Preserved` phase |
| 113 | +1. Machine is in `Running:Preserved` or `Failed:Preserved` phase |
145 | 114 | 2. Operator adds: `node.machine.sapcloud.io/preserve=false` to node.
|
146 |
| -3. MCM transitions machine to `Running` or `Terminating`, depending on which phase it was in before moving to `Preserved`, even though `machinePreserveTimeout` has not expired |
147 |
| -4. Capacity becomes available for preserving future annotated machines or for auto-preservation of `Failed` machines. |
| 115 | +3. MCM transitions machine to `Running` or `Terminating`, for `Running:Preserved` or `Failed:Preserved` respectively, even though `machinePreserveTimeout` has not expired |
| 116 | +4. If machine was in `Failed:Preserved`, capacity becomes available for auto-preservation. |
148 | 117 |
|
149 | 118 |
|
150 |
| -## Limitations |
| 119 | +## Points to Note |
151 | 120 |
|
152 | 121 | 1. During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase.
|
153 |
| -2. Since gardener worker pool can correspond to 1..N MachineDeployments depending on number of zones, we will need to distribute the `machinePreserveMax` across N machine deployments. |
154 |
| -So, even if there are no failed machines preserved in other zones, the max per zone would still be enforced. Hence, the value of `machinePreserveMax` should be chosen appropriately. |
| 122 | +2. Hibernation policy would override machine preservation. |
| 123 | +3. If Machine and Node annotation values differ for a particular annotation key (including `node.machine.sapcloud.io/preserve=true`), the Node annotation value will override the Machine annotation value. |
| 124 | +4. If `autoPreserveFailedMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones. |
| 125 | +5. In case of a scale down of an MCD's replica count, `Preserved` machines will be the last to be scaled down. Replica count will always be honoured. |
| 126 | +6. If the value for annotation key `cluster-autoscaler.kubernetes.io/scale-down-disabled` for a machine in `Running:Preserved` is changed to `false` by a user, the value will be overwritten to `true` by MCM. |
| 127 | +7. On increase/decrease of timeout- new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines. |
| 128 | + - can specify timeout |
| 129 | +8. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=true` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would: |
| 130 | + - harmonise machine flow |
| 131 | + - shield from CA's internals |
| 132 | + - make it generic and no longer CA specific |
| 133 | + - allow a timeout to be specified |
0 commit comments