|
| 1 | +# Cancellation in Hyperlight |
| 2 | + |
| 3 | +This document describes the cancellation mechanism and memory ordering guarantees for Hyperlight. |
| 4 | + |
| 5 | +## Overview (Linux) |
| 6 | + |
| 7 | +Hyperlight provides a mechanism to forcefully interrupt guest execution through the `InterruptHandle::kill()` method. This involves coordination between multiple threads using atomic operations and POSIX signals to ensure safe and reliable cancellation. |
| 8 | + |
| 9 | +## Key Components |
| 10 | + |
| 11 | +### LinuxInterruptHandle State |
| 12 | + |
| 13 | +The `LinuxInterruptHandle` uses a packed atomic u8 to track execution state: |
| 14 | + |
| 15 | +- **state (AtomicU8)**: Packs three bits: |
| 16 | + - **Bit 2 (DEBUG_INTERRUPT_BIT)**: Set when debugger interrupt is requested (gdb feature only) |
| 17 | + - **Bit 1 (RUNNING_BIT)**: Set when vCPU is actively running in guest mode |
| 18 | + - **Bit 0 (CANCEL_BIT)**: Set when cancellation has been requested via `kill()` |
| 19 | +- **tid (AtomicU64)**: Thread ID where the vCPU is running |
| 20 | +- **dropped (AtomicBool)**: Set when the corresponding VM has been dropped |
| 21 | + |
| 22 | +The packed state enables atomic reads of RUNNING_BIT, CANCEL_BIT and DEBUG_INTERRUPT_BIT simultaneously via `get_running_cancel_debug()`. Within a single `VirtualCPU::run()` call, the CANCEL_BIT remains set across vcpu exits and re-entries (such as when calling host functions), ensuring cancellation persists until the guest call completes. However, `clear_cancel()` resets the CANCEL_BIT at the beginning of each new guest function call (specifically in `MultiUseSandbox::call`, before `VirtualCPU::run()` is called), preventing cancellation requests from affecting subsequent guest function calls. |
| 23 | + |
| 24 | +### Signal Mechanism |
| 25 | + |
| 26 | +On Linux, Hyperlight uses `SIGRTMIN + offset` (configurable, default offset is 0) to interrupt the vCPU thread. The signal handler is intentionally a no-op - the signal's only purpose is to cause a VM exit via `EINTR` from the `ioctl` call that runs the vCPU. |
| 27 | + |
| 28 | +## Run Loop Flow |
| 29 | + |
| 30 | +The main execution loop in `VirtualCPU::run()` coordinates vCPU execution with potential interrupts. |
| 31 | + |
| 32 | +```mermaid |
| 33 | +sequenceDiagram |
| 34 | + participant Caller as Caller (call()) |
| 35 | + participant vCPU as vCPU (run()) |
| 36 | + participant IH as InterruptHandle |
| 37 | +
|
| 38 | + Note over Caller: === TIMING POINT 1 === |
| 39 | + Caller->>IH: clear_cancel() |
| 40 | + Note right of Caller: Start of cancellable window |
| 41 | +
|
| 42 | + Caller->>vCPU: run() |
| 43 | + activate vCPU |
| 44 | +
|
| 45 | + loop Run Loop |
| 46 | + Note over vCPU: === TIMING POINT 2 === |
| 47 | + vCPU->>IH: set_tid() |
| 48 | + vCPU->>IH: set_running() |
| 49 | + Note right of vCPU: Enable signal delivery |
| 50 | +
|
| 51 | + vCPU->>IH: is_cancelled() |
| 52 | + |
| 53 | + alt is_cancelled() == true |
| 54 | + vCPU-->>Caller: return Cancelled() |
| 55 | + else is_cancelled() == false |
| 56 | + Note over vCPU: === TIMING POINT 3 === |
| 57 | + vCPU->>vCPU: run_vcpu() (Enter Guest) |
| 58 | + activate vCPU |
| 59 | + |
| 60 | + alt Guest completes normally |
| 61 | + vCPU-->>vCPU: VmExit::Halt() |
| 62 | + else Guest performs I/O |
| 63 | + vCPU-->>vCPU: VmExit::IoOut()/MmioRead() |
| 64 | + else Signal received |
| 65 | + vCPU-->>vCPU: VmExit::Cancelled() |
| 66 | + end |
| 67 | + deactivate vCPU |
| 68 | + end |
| 69 | +
|
| 70 | + Note over vCPU: === TIMING POINT 4 === |
| 71 | + vCPU->>IH: clear_running() |
| 72 | + Note right of vCPU: Disable signal delivery |
| 73 | +
|
| 74 | + Note over vCPU: === TIMING POINT 5 === |
| 75 | + vCPU->>IH: is_cancelled() |
| 76 | + IH-->>vCPU: cancel_requested (bool) |
| 77 | + Note right of vCPU: Check if we should exit |
| 78 | +
|
| 79 | + Note over vCPU: === TIMING POINT 6 === |
| 80 | + |
| 81 | + alt Exit reason is Halt |
| 82 | + vCPU-->>Caller: return Ok(()) |
| 83 | + else Exit reason is Cancelled AND cancel_requested==true |
| 84 | + vCPU-->>Caller: return Err(ExecutionCanceledByHost) |
| 85 | + else Exit reason is Cancelled AND cancel_requested==false |
| 86 | + Note right of vCPU: Stale signal, retry |
| 87 | + vCPU->>vCPU: continue (retry iteration) |
| 88 | + else Exit reason is I/O or host call |
| 89 | + vCPU->>vCPU: Handle and continue loop |
| 90 | + end |
| 91 | + end |
| 92 | + deactivate vCPU |
| 93 | +``` |
| 94 | + |
| 95 | +### Detailed Run Loop Steps |
| 96 | + |
| 97 | +1. **Timing Point 1** - Start of Guest Call (in `call()`): |
| 98 | + - `clear_cancel()` resets the cancellation state *before* `run()` is called. |
| 99 | + - Any `kill()` completed before this point is ignored. |
| 100 | + |
| 101 | +2. **Timing Point 2** - Start of Loop Iteration: |
| 102 | + - `set_running()` enables signal delivery. |
| 103 | + - Checks `is_cancelled()` immediately to handle pre-run cancellation. |
| 104 | + |
| 105 | +3. **Timing Point 3** - Guest Entry: |
| 106 | + - Enters guest execution. |
| 107 | + - If `kill()` happens now, signals will interrupt the guest. |
| 108 | + |
| 109 | +4. **Timing Point 4** - Guest Exit: |
| 110 | + - `clear_running()` disables signal delivery. |
| 111 | + - Signals sent after this point are ignored. |
| 112 | + |
| 113 | +5. **Timing Point 5** - Capture State: |
| 114 | + - `is_cancelled()` captures the cancellation request state. |
| 115 | + - This determines if a `Cancelled` exit was genuine or stale. |
| 116 | + |
| 117 | +6. **Timing Point 6** - Handle Exit: |
| 118 | + - Processes the exit reason based on the captured `cancel_requested` state. |
| 119 | + - If `Cancelled` but `!cancel_requested`, it's a stale signal -> retry. |
| 120 | + |
| 121 | +## Kill Operation Flow |
| 122 | + |
| 123 | +The `kill()` operation involves setting the CANCEL_BIT and sending signals to interrupt the vCPU: |
| 124 | + |
| 125 | +```mermaid |
| 126 | +sequenceDiagram |
| 127 | + participant Caller as Caller Thread |
| 128 | + participant IH as InterruptHandle |
| 129 | + participant Signal as Signal Delivery |
| 130 | + participant vCPU as vCPU Thread |
| 131 | +
|
| 132 | + Caller->>IH: kill() |
| 133 | + activate IH |
| 134 | + |
| 135 | + IH->>IH: fetch_or(CANCEL_BIT, Release) |
| 136 | + Note right of IH: Atomically set cancel=true<br/>with Release ordering |
| 137 | + |
| 138 | + IH->>IH: send_signal() |
| 139 | + activate IH |
| 140 | + |
| 141 | + loop Retry Loop |
| 142 | + IH->>IH: get_running_and_cancel() |
| 143 | + Note right of IH: Load with Acquire ordering |
| 144 | + |
| 145 | + alt Not running OR not cancelled |
| 146 | + IH-->>IH: break (sent_signal=false/true) |
| 147 | + else Running AND cancelled |
| 148 | + IH->>IH: tid.load(Acquire) |
| 149 | + IH->>Signal: pthread_kill(tid, SIGRTMIN+offset) |
| 150 | + activate Signal |
| 151 | + Note right of Signal: Send signal to vCPU thread |
| 152 | + Signal->>vCPU: SIGRTMIN+offset delivered |
| 153 | + Note right of vCPU: Signal handler is no-op<br/>Purpose is to cause EINTR |
| 154 | + deactivate Signal |
| 155 | + |
| 156 | + alt Signal arrives during ioctl |
| 157 | + vCPU->>vCPU: ioctl returns EINTR |
| 158 | + vCPU->>vCPU: return VmExit::Cancelled() |
| 159 | + else Signal arrives between ioctls |
| 160 | + Note right of vCPU: Signal is harmless |
| 161 | + end |
| 162 | + |
| 163 | + IH->>IH: sleep(retry_delay) |
| 164 | + Note right of IH: Default 500μs between retries |
| 165 | + end |
| 166 | + end |
| 167 | + |
| 168 | + deactivate IH |
| 169 | + IH-->>Caller: sent_signal |
| 170 | + deactivate IH |
| 171 | +``` |
| 172 | + |
| 173 | +### Kill Operation Steps |
| 174 | + |
| 175 | +1. **Set Cancel Flag**: Atomically set the CANCEL_BIT using `fetch_or(CANCEL_BIT)` with `Release` ordering |
| 176 | + - Ensures all writes before `kill()` are visible when vCPU thread checks `is_cancelled()` with `Acquire` |
| 177 | + |
| 178 | +2. **Send Signals**: Enter retry loop via `send_signal()` |
| 179 | + - Atomically load running, cancel and debug flags via `get_running_cancel_debug()` with `Acquire` ordering |
| 180 | + - Continue if `running=true AND cancel=true` (or `running=true AND debug=true` with gdb) |
| 181 | + - Exit loop immediately if `running=false OR (cancel=false AND debug=false)` |
| 182 | + |
| 183 | +3. **Signal Delivery**: Send `SIGRTMIN+offset` via `pthread_kill` |
| 184 | + - Signal interrupts the `ioctl` that runs the vCPU, causing `EINTR` |
| 185 | + - Signal handler is intentionally a no-op |
| 186 | + - Returns `VmExit::Cancelled()` when `EINTR` is received |
| 187 | + |
| 188 | +4. **Loop Termination**: The signal loop terminates when: |
| 189 | + - vCPU is no longer running (`running=false`), OR |
| 190 | + - Cancellation is no longer requested (`cancel=false`) |
| 191 | + - See the loop termination proof in the source code for rigorous correctness analysis |
| 192 | + |
| 193 | +## Memory Ordering Guarantees |
| 194 | + |
| 195 | +Hyperlight uses Release-Acquire semantics to ensure correctness across threads: |
| 196 | + |
| 197 | +```mermaid |
| 198 | +graph TB |
| 199 | + subgraph "vCPU Thread" |
| 200 | + A[set_tid<br/>Store tid with Release] |
| 201 | + B[set_running<br/>fetch_update RUNNING_BIT<br/>with Release] |
| 202 | + C[is_cancelled<br/>Load with Acquire] |
| 203 | + D[clear_running<br/>fetch_and with Release] |
| 204 | + J[is_debug_interrupted<br/>Load with Acquire] |
| 205 | + end |
| 206 | + |
| 207 | + subgraph "Interrupt Thread" |
| 208 | + E[kill<br/>fetch_or CANCEL_BIT<br/>with Release] |
| 209 | + F[send_signal<br/>Load running with Acquire] |
| 210 | + G[Load tid with Acquire] |
| 211 | + H[pthread_kill] |
| 212 | + I[kill_from_debugger<br/>fetch_or DEBUG_INTERRUPT_BIT<br/>with Release] |
| 213 | + end |
| 214 | + |
| 215 | + B -->|Synchronizes-with| F |
| 216 | + A -->|Happens-before via B→F| G |
| 217 | + E -->|Synchronizes-with| C |
| 218 | + D -->|Synchronizes-with| F |
| 219 | + I -->|Synchronizes-with| J |
| 220 | +``` |
| 221 | + |
| 222 | +### Ordering Rules |
| 223 | + |
| 224 | +1. **tid Store → running Load**: `set_tid` (Release) synchronizes with `send_signal` (Acquire), ensuring the interrupt thread sees the correct thread ID. |
| 225 | +2. **CANCEL_BIT**: `kill` (Release) synchronizes with `is_cancelled` (Acquire), ensuring the vCPU sees the cancellation request. |
| 226 | +3. **clear_running**: `clear_running` (Release) synchronizes with `send_signal` (Acquire), ensuring the interrupt thread stops sending signals when the vCPU stops. |
| 227 | +4. **clear_cancel**: Uses Release to ensure operations from the previous run are visible to other threads. |
| 228 | +5. **dropped flag**: `set_dropped` (Release) synchronizes with `dropped` (Acquire), ensuring cleanup visibility. |
| 229 | +6. **debug_interrupt**: `kill_from_debugger` (Release) synchronizes with `is_debug_interrupted` (Acquire), ensuring the vCPU sees the debug interrupt request. |
| 230 | + |
| 231 | +## Interaction with Host Function Calls |
| 232 | + |
| 233 | +When a guest performs a host function call, the vCPU exits and `RUNNING_BIT` is cleared. `CANCEL_BIT` persists, so if `kill()` is called during the host call, cancellation is detected when the guest attempts to resume. |
| 234 | + |
| 235 | +## Signal Behavior Across Loop Iterations |
| 236 | + |
| 237 | +When the run loop iterates (e.g., for host calls): |
| 238 | +1. `clear_running()` sets `running=false`, causing any active `send_signal()` loop to exit. |
| 239 | +2. `set_running()` sets `running=true` again. |
| 240 | +3. `is_cancelled()` detects the persistent `cancel` flag and returns early. |
| 241 | + |
| 242 | +## Race Conditions |
| 243 | + |
| 244 | +1. **kill() between calls**: `clear_cancel()` at Timing Point 1 ensures `kill()` requests from before the current call are ignored. |
| 245 | +2. **kill() before run_vcpu()**: Signals interrupt the guest immediately. |
| 246 | +3. **Guest completes before signal**: If the guest finishes naturally, the signal is ignored or causes a retry in the next iteration (handled as stale). |
| 247 | +4. **Stale signals**: If a signal from a previous call arrives during a new call, `cancel_requested` (checked at Timing Point 5) will be false, causing a retry. |
| 248 | +5. **ABA Problem**: Clearing `CANCEL_BIT` at the start of `run()` breaks any ongoing `send_signal()` loops from previous calls. |
| 249 | + |
| 250 | +## Windows Platform Differences |
| 251 | + |
| 252 | +While the core cancellation mechanism follows the same conceptual model on Windows, there are several platform-specific differences in implementation: |
| 253 | + |
| 254 | +### WindowsInterruptHandle Structure |
| 255 | + |
| 256 | +The `WindowsInterruptHandle` uses a simpler structure compared to Linux: |
| 257 | + |
| 258 | +- **state (AtomicU8)**: Packs three bits (RUNNING_BIT, CANCEL_BIT and DEBUG_INTERRUPT_BIT) |
| 259 | +- **partition_handle**: Windows Hyper-V partition handle for the VM |
| 260 | +- **dropped (AtomicBool)**: Set when the corresponding VM has been dropped |
| 261 | + |
| 262 | +**Key difference**: No `tid` field is needed because Windows doesn't use thread-targeted signals. No `retry_delay` or `sig_rt_min_offset` fields are needed. |
| 263 | + |
| 264 | +### Kill Operation Differences |
| 265 | + |
| 266 | +On Windows, the `kill()` method uses the Windows Hypervisor Platform (WHP) API `WHvCancelRunVirtualProcessor` instead of POSIX signals to interrupt the vCPU: |
| 267 | + |
| 268 | +**Key differences**: |
| 269 | +1. **No signal loop**: Windows calls `WHvCancelRunVirtualProcessor()` at most once in `kill()`, without needing retries |
| 270 | + |
| 271 | +### Why Linux Needs a Retry Loop but Windows Doesn't |
| 272 | + |
| 273 | +The fundamental difference between the platforms lies in how cancellation interacts with the hypervisor: |
| 274 | + |
| 275 | +**Linux (KVM/mshv3)**: POSIX signals can only interrupt the vCPU when the thread is executing kernel code (specifically, during the `ioctl` syscall that runs the vCPU). There is a narrow timing window between when the signal is sent and when the vCPU enters guest mode. If a signal arrives before entering guest mode, it will be delivered but won't interrupt the guest execution. This requires repeatedly sending signals with delays until either: |
| 276 | +- The vCPU exits (and consequently RUNNING_BIT becomes false), or |
| 277 | +- The cancellation is cleared (CANCEL_BIT becomes false) |
| 278 | + |
| 279 | +**Windows (WHP)**: The `WHvCancelRunVirtualProcessor()` API sets an internal `CancelPending` flag in the Windows Hypervisor Platform. This flag is: |
| 280 | +- Set immediately by the API call |
| 281 | +- Checked at the start of each VM run loop iteration (before entering guest mode) |
| 282 | +- Automatically cleared when it causes a `WHvRunVpExitReasonCanceled` exit |
| 283 | + |
| 284 | +This means if `WHvCancelRunVirtualProcessor()` is called: |
| 285 | +- **While the vCPU is running**: The API signals the hypervisor to exit with `WHvRunVpExitReasonCanceled` |
| 286 | +- **Before VM runs**: The `CancelPending` flag persists and causes an immediate cancellation on the next VM run attempt |
| 287 | + |
| 288 | +Therefore, we only call `WHvCancelRunVirtualProcessor()` after checking that `RUNNING_BIT` is set. This is important because: |
| 289 | +1. If called when not running, the API would still succeed and will unconditionally cancel the next run attempt. This is bad since `kill()` should have no effect if the vCPU is not running |
| 290 | +2. This makes the InterruptHandle's `CANCEL_BIT` (which is cleared at the start of each guest function call) the source of truth for whether cancellation is intended for the current call |
| 291 | + |
0 commit comments