Skip to content

Commit 9fbdb30

Browse files
committed
Add proposal for preservation of failed machines
1 parent 32bd76c commit 9fbdb30

File tree

1 file changed

+103
-0
lines changed

1 file changed

+103
-0
lines changed
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# Preservation of Failed Machines
2+
3+
<!-- TOC -->
4+
5+
- [Preservation of Failed Machines](#preservation-of-failed-machines)
6+
- [Objective](#objective)
7+
- [Solution Design](#solution-design)
8+
- [State Machine](#state-machine)
9+
- [Use Cases](#use-cases)
10+
11+
12+
<!-- /TOC -->
13+
14+
## Objective
15+
16+
Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout` seconds, to the `Failed` phase.
17+
`Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult.
18+
19+
This document proposes enhancing MCM, such that:
20+
* VMs of `Failed` machines are retained temporarily for analysis
21+
* There is a configurable limit to the number of `Failed` machines that can be preserved
22+
* There is a configurable limit to the duration for which such machines are preserved
23+
* Users can specify which healthy machines they would like to preserve in case of failure
24+
* Users can request MCM to delete a preserved `Failed` machine, even before the timeout expires
25+
26+
## Solution Design
27+
28+
In order to achieve the objectives mentioned, the following are proposed:
29+
1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of failed machines to be preserved,
30+
and the time duration for which these machines will be preserved.
31+
```
32+
machineControllerManager:
33+
failedMachinePreserveMax: 2
34+
failedMachinePreserveTimeout: 3h
35+
```
36+
* Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments.
37+
* `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
38+
2. Allow user/operator to explicitly request for preservation of a machine if it moves to `Failed` phase with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`.
39+
When such an annotated machine transitions from `Unknown` to `Failed`, it is prevented from moving to `Terminating` phase until `failedMachinePreserveTimeout` expires.
40+
* A user/operator can request MCM to stop preserving a preserved `Failed` machine by adding/modifying the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`.
41+
* For a machine thus annotated, MCM will move it to `Terminating` phase even if `failedMachinePreserveTimeout` has not expired.
42+
3. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine.
43+
4. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`.
44+
* In this new stage, pods can be evicted and scheduled on other healthy machines, and the user/operator can wait for the corresponding VM to potentially recover. If the machine moves to `Running` phase on recovery, new pods can be scheduled on it. It is yet to be determined whether this feature will be required.
45+
46+
47+
## State Machine
48+
49+
The behaviour described above can be summarised using the state machine below:
50+
51+
```
52+
(Running Machine)
53+
├── [User adds `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running + Requested)
54+
└── [Machine fails + capacity available] → (PreserveFailed)
55+
56+
(Running + Requested)
57+
├── [Machine fails + capacity available] → (PreserveFailed)
58+
├── [Machine fails + no capacity] → Failed → Terminating
59+
└── [User removes `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running)
60+
61+
(PreserveFailed)
62+
├── [User adds `node.machine.sapcloud.io/preserve-when-failed=false`] → Terminating
63+
└── [failedMachinePreserveTimeout expires] → Terminating
64+
65+
```
66+
In the above state machine, the phase `Running` also includes machines that are in the process of creation for which no errors have been encountered yet.
67+
The transition of moving a machine from `PreserveFailed` to `Running` has not been shown since we haven't determined whether it is in scope for the current iteration of this feature.
68+
69+
## Use Cases:
70+
71+
### Use Case 1: Proactive Preservation Request
72+
**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis.
73+
#### Steps:
74+
1. Operator annotates node with `node.machine.sapcloud.io/preserve-when-failed=true`
75+
2. Machine fails later
76+
3. MCM preserves the machine (if capacity allows)
77+
4. Operator analyzes the failed VM
78+
5. Operator releases the failed machine by setting `node.machine.sapcloud.io/preserve-when-failed=false` on the node object
79+
80+
### Use Case 2: Automatic Preservation
81+
**Scenario:** Machine fails unexpectedly, no prior annotation.
82+
#### Steps:
83+
1. Machine transitions to Failed state
84+
2. MCM checks preservation capacity
85+
3. If capacity available, machine moved to `PreserveFailed` phase by MCM
86+
4. After timeout, machine is terminated by MCM
87+
88+
### Use Case 3: Capacity Management
89+
**Scenario:** Multiple machines fail when preservation capacity is full.
90+
#### Steps:
91+
1. Machines M1, M2 already preserved (capacity = 2)
92+
2. Machine M3 fails with annotation `node.machine.sapcloud.io/preserve-when-failed=true` set
93+
3. MCM cannot preserve M3 due to capacity limits
94+
4. M3 moved from `Failed` to `Terminating` by MCM, following which it is deleted
95+
96+
### Use Case 4: Early Release
97+
**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved
98+
99+
#### Steps:
100+
1. Machine M1 is in `PreserveFailed` phase
101+
2. Operator adds: `node.machine.sapcloud.io/preserve-when-failed=false` to node.
102+
3. MCM transitions M1 to `Terminating`
103+
4. Capacity becomes available for preserving future `Failed` machines.

0 commit comments

Comments
 (0)