|
| 1 | +# Preservation of Failed Machines |
| 2 | + |
| 3 | +<!-- TOC --> |
| 4 | + |
| 5 | +- [Preservation of Failed Machines](#preservation-of-failed-machines) |
| 6 | + - [Objective](#objective) |
| 7 | + - [Solution Design](#solution-design) |
| 8 | + - [State Machine](#state-machine) |
| 9 | + - [Use Cases](#use-cases) |
| 10 | + |
| 11 | + |
| 12 | +<!-- /TOC --> |
| 13 | + |
| 14 | +## Objective |
| 15 | + |
| 16 | +Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout` seconds, to the `Failed` phase. |
| 17 | +`Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult. |
| 18 | + |
| 19 | +This document proposes enhancing MCM, such that: |
| 20 | +* VMs of `Failed` machines are retained temporarily for analysis |
| 21 | +* There is a configurable limit to the number of `Failed` machines that can be preserved |
| 22 | +* There is a configurable limit to the duration for which such machines are preserved |
| 23 | +* Users can specify which healthy machines they would like to preserve in case of failure |
| 24 | +* Users can request MCM to delete a preserved `Failed` machine, even before the timeout expires |
| 25 | + |
| 26 | +## Solution Design |
| 27 | + |
| 28 | +In order to achieve the objectives mentioned, the following are proposed: |
| 29 | +1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of failed machines to be preserved, |
| 30 | +and the time duration for which these machines will be preserved. |
| 31 | + ``` |
| 32 | + machineControllerManager: |
| 33 | + failedMachinePreserveMax: 2 |
| 34 | + failedMachinePreserveTimeout: 3h |
| 35 | + ``` |
| 36 | + * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments. |
| 37 | + * `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. |
| 38 | +2. Allow user/operator to explicitly request for preservation of a machine if it moves to `Failed` phase with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`. |
| 39 | +When such an annotated machine transitions from `Unknown` to `Failed`, it is prevented from moving to `Terminating` phase until `failedMachinePreserveTimeout` expires. |
| 40 | + * A user/operator can request MCM to stop preserving a preserved `Failed` machine by adding/modifying the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`. |
| 41 | + * For a machine thus annotated, MCM will move it to `Terminating` phase even if `failedMachinePreserveTimeout` has not expired. |
| 42 | +3. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. |
| 43 | +4. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. |
| 44 | + * In this new stage, pods can be evicted and scheduled on other healthy machines, and the user/operator can wait for the corresponding VM to potentially recover. If the machine moves to `Running` phase on recovery, new pods can be scheduled on it. It is yet to be determined whether this feature will be required. |
| 45 | + |
| 46 | + |
| 47 | +## State Machine |
| 48 | + |
| 49 | +The behaviour described above can be summarised using the state machine below: |
| 50 | + |
| 51 | +``` |
| 52 | +(Running Machine) |
| 53 | +├── [User adds `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running + Requested) |
| 54 | +└── [Machine fails + capacity available] → (PreserveFailed) |
| 55 | +
|
| 56 | +(Running + Requested) |
| 57 | +├── [Machine fails + capacity available] → (PreserveFailed) |
| 58 | +├── [Machine fails + no capacity] → Failed → Terminating |
| 59 | +└── [User removes `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running) |
| 60 | +
|
| 61 | +(PreserveFailed) |
| 62 | +├── [User adds `node.machine.sapcloud.io/preserve-when-failed=false`] → Terminating |
| 63 | +└── [failedMachinePreserveTimeout expires] → Terminating |
| 64 | +
|
| 65 | +``` |
| 66 | +In the above state machine, the phase `Running` also includes machines that are in the process of creation for which no errors have been encountered yet. |
| 67 | +The transition of moving a machine from `PreserveFailed` to `Running` has not been shown since we haven't determined whether it is in scope for the current iteration of this feature. |
| 68 | + |
| 69 | +## Use Cases: |
| 70 | + |
| 71 | +### Use Case 1: Proactive Preservation Request |
| 72 | +**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. |
| 73 | +#### Steps: |
| 74 | +1. Operator annotates node with `node.machine.sapcloud.io/preserve-when-failed=true` |
| 75 | +2. Machine fails later |
| 76 | +3. MCM preserves the machine (if capacity allows) |
| 77 | +4. Operator analyzes the failed VM |
| 78 | +5. Operator releases the failed machine by setting `node.machine.sapcloud.io/preserve-when-failed=false` on the node object |
| 79 | + |
| 80 | +### Use Case 2: Automatic Preservation |
| 81 | +**Scenario:** Machine fails unexpectedly, no prior annotation. |
| 82 | +#### Steps: |
| 83 | +1. Machine transitions to Failed state |
| 84 | +2. MCM checks preservation capacity |
| 85 | +3. If capacity available, machine moved to `PreserveFailed` phase by MCM |
| 86 | +4. After timeout, machine is terminated by MCM |
| 87 | + |
| 88 | +### Use Case 3: Capacity Management |
| 89 | +**Scenario:** Multiple machines fail when preservation capacity is full. |
| 90 | +#### Steps: |
| 91 | +1. Machines M1, M2 already preserved (capacity = 2) |
| 92 | +2. Machine M3 fails with annotation `node.machine.sapcloud.io/preserve-when-failed=true` set |
| 93 | +3. MCM cannot preserve M3 due to capacity limits |
| 94 | +4. M3 moved from `Failed` to `Terminating` by MCM, following which it is deleted |
| 95 | + |
| 96 | +### Use Case 4: Early Release |
| 97 | +**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved |
| 98 | + |
| 99 | +#### Steps: |
| 100 | +1. Machine M1 is in `PreserveFailed` phase |
| 101 | +2. Operator adds: `node.machine.sapcloud.io/preserve-when-failed=false` to node. |
| 102 | +3. MCM transitions M1 to `Terminating` |
| 103 | +4. Capacity becomes available for preserving future `Failed` machines. |
0 commit comments