Skip to content

Commit 9ab8884

Browse files
authored
Simplify cancellation (#1024)
* Simplify cancellation Signed-off-by: Ludvig Liljenberg <[email protected]> * PR feedback Signed-off-by: Ludvig Liljenberg <[email protected]> * Add comment about order of set_running vs is_cancelled Signed-off-by: Ludvig Liljenberg <[email protected]> * Add test that makes sure kill() never fails Signed-off-by: Ludvig Liljenberg <[email protected]> * Add tests that tests moving sandbox across thread doesn't cancel wrong sandbox Signed-off-by: Ludvig Liljenberg <[email protected]> * Move debug_interrupt AtomicBool into state AtomicU64 Signed-off-by: Ludvig Liljenberg <[email protected]> * Change interrupt_handle state from AtomicU64 to AtomicU8 Signed-off-by: Ludvig Liljenberg <[email protected]> --------- Signed-off-by: Ludvig Liljenberg <[email protected]>
1 parent 9b8ade9 commit 9ab8884

File tree

9 files changed

+1009
-821
lines changed

9 files changed

+1009
-821
lines changed

docs/cancellation.md

Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
# Cancellation in Hyperlight
2+
3+
This document describes the cancellation mechanism and memory ordering guarantees for Hyperlight.
4+
5+
## Overview (Linux)
6+
7+
Hyperlight provides a mechanism to forcefully interrupt guest execution through the `InterruptHandle::kill()` method. This involves coordination between multiple threads using atomic operations and POSIX signals to ensure safe and reliable cancellation.
8+
9+
## Key Components
10+
11+
### LinuxInterruptHandle State
12+
13+
The `LinuxInterruptHandle` uses a packed atomic u8 to track execution state:
14+
15+
- **state (AtomicU8)**: Packs three bits:
16+
- **Bit 2 (DEBUG_INTERRUPT_BIT)**: Set when debugger interrupt is requested (gdb feature only)
17+
- **Bit 1 (RUNNING_BIT)**: Set when vCPU is actively running in guest mode
18+
- **Bit 0 (CANCEL_BIT)**: Set when cancellation has been requested via `kill()`
19+
- **tid (AtomicU64)**: Thread ID where the vCPU is running
20+
- **dropped (AtomicBool)**: Set when the corresponding VM has been dropped
21+
22+
The packed state enables atomic reads of RUNNING_BIT, CANCEL_BIT and DEBUG_INTERRUPT_BIT simultaneously via `get_running_cancel_debug()`. Within a single `VirtualCPU::run()` call, the CANCEL_BIT remains set across vcpu exits and re-entries (such as when calling host functions), ensuring cancellation persists until the guest call completes. However, `clear_cancel()` resets the CANCEL_BIT at the beginning of each new guest function call (specifically in `MultiUseSandbox::call`, before `VirtualCPU::run()` is called), preventing cancellation requests from affecting subsequent guest function calls.
23+
24+
### Signal Mechanism
25+
26+
On Linux, Hyperlight uses `SIGRTMIN + offset` (configurable, default offset is 0) to interrupt the vCPU thread. The signal handler is intentionally a no-op - the signal's only purpose is to cause a VM exit via `EINTR` from the `ioctl` call that runs the vCPU.
27+
28+
## Run Loop Flow
29+
30+
The main execution loop in `VirtualCPU::run()` coordinates vCPU execution with potential interrupts.
31+
32+
```mermaid
33+
sequenceDiagram
34+
participant Caller as Caller (call())
35+
participant vCPU as vCPU (run())
36+
participant IH as InterruptHandle
37+
38+
Note over Caller: === TIMING POINT 1 ===
39+
Caller->>IH: clear_cancel()
40+
Note right of Caller: Start of cancellable window
41+
42+
Caller->>vCPU: run()
43+
activate vCPU
44+
45+
loop Run Loop
46+
Note over vCPU: === TIMING POINT 2 ===
47+
vCPU->>IH: set_tid()
48+
vCPU->>IH: set_running()
49+
Note right of vCPU: Enable signal delivery
50+
51+
vCPU->>IH: is_cancelled()
52+
53+
alt is_cancelled() == true
54+
vCPU-->>Caller: return Cancelled()
55+
else is_cancelled() == false
56+
Note over vCPU: === TIMING POINT 3 ===
57+
vCPU->>vCPU: run_vcpu() (Enter Guest)
58+
activate vCPU
59+
60+
alt Guest completes normally
61+
vCPU-->>vCPU: VmExit::Halt()
62+
else Guest performs I/O
63+
vCPU-->>vCPU: VmExit::IoOut()/MmioRead()
64+
else Signal received
65+
vCPU-->>vCPU: VmExit::Cancelled()
66+
end
67+
deactivate vCPU
68+
end
69+
70+
Note over vCPU: === TIMING POINT 4 ===
71+
vCPU->>IH: clear_running()
72+
Note right of vCPU: Disable signal delivery
73+
74+
Note over vCPU: === TIMING POINT 5 ===
75+
vCPU->>IH: is_cancelled()
76+
IH-->>vCPU: cancel_requested (bool)
77+
Note right of vCPU: Check if we should exit
78+
79+
Note over vCPU: === TIMING POINT 6 ===
80+
81+
alt Exit reason is Halt
82+
vCPU-->>Caller: return Ok(())
83+
else Exit reason is Cancelled AND cancel_requested==true
84+
vCPU-->>Caller: return Err(ExecutionCanceledByHost)
85+
else Exit reason is Cancelled AND cancel_requested==false
86+
Note right of vCPU: Stale signal, retry
87+
vCPU->>vCPU: continue (retry iteration)
88+
else Exit reason is I/O or host call
89+
vCPU->>vCPU: Handle and continue loop
90+
end
91+
end
92+
deactivate vCPU
93+
```
94+
95+
### Detailed Run Loop Steps
96+
97+
1. **Timing Point 1** - Start of Guest Call (in `call()`):
98+
- `clear_cancel()` resets the cancellation state *before* `run()` is called.
99+
- Any `kill()` completed before this point is ignored.
100+
101+
2. **Timing Point 2** - Start of Loop Iteration:
102+
- `set_running()` enables signal delivery.
103+
- Checks `is_cancelled()` immediately to handle pre-run cancellation.
104+
105+
3. **Timing Point 3** - Guest Entry:
106+
- Enters guest execution.
107+
- If `kill()` happens now, signals will interrupt the guest.
108+
109+
4. **Timing Point 4** - Guest Exit:
110+
- `clear_running()` disables signal delivery.
111+
- Signals sent after this point are ignored.
112+
113+
5. **Timing Point 5** - Capture State:
114+
- `is_cancelled()` captures the cancellation request state.
115+
- This determines if a `Cancelled` exit was genuine or stale.
116+
117+
6. **Timing Point 6** - Handle Exit:
118+
- Processes the exit reason based on the captured `cancel_requested` state.
119+
- If `Cancelled` but `!cancel_requested`, it's a stale signal -> retry.
120+
121+
## Kill Operation Flow
122+
123+
The `kill()` operation involves setting the CANCEL_BIT and sending signals to interrupt the vCPU:
124+
125+
```mermaid
126+
sequenceDiagram
127+
participant Caller as Caller Thread
128+
participant IH as InterruptHandle
129+
participant Signal as Signal Delivery
130+
participant vCPU as vCPU Thread
131+
132+
Caller->>IH: kill()
133+
activate IH
134+
135+
IH->>IH: fetch_or(CANCEL_BIT, Release)
136+
Note right of IH: Atomically set cancel=true<br/>with Release ordering
137+
138+
IH->>IH: send_signal()
139+
activate IH
140+
141+
loop Retry Loop
142+
IH->>IH: get_running_and_cancel()
143+
Note right of IH: Load with Acquire ordering
144+
145+
alt Not running OR not cancelled
146+
IH-->>IH: break (sent_signal=false/true)
147+
else Running AND cancelled
148+
IH->>IH: tid.load(Acquire)
149+
IH->>Signal: pthread_kill(tid, SIGRTMIN+offset)
150+
activate Signal
151+
Note right of Signal: Send signal to vCPU thread
152+
Signal->>vCPU: SIGRTMIN+offset delivered
153+
Note right of vCPU: Signal handler is no-op<br/>Purpose is to cause EINTR
154+
deactivate Signal
155+
156+
alt Signal arrives during ioctl
157+
vCPU->>vCPU: ioctl returns EINTR
158+
vCPU->>vCPU: return VmExit::Cancelled()
159+
else Signal arrives between ioctls
160+
Note right of vCPU: Signal is harmless
161+
end
162+
163+
IH->>IH: sleep(retry_delay)
164+
Note right of IH: Default 500μs between retries
165+
end
166+
end
167+
168+
deactivate IH
169+
IH-->>Caller: sent_signal
170+
deactivate IH
171+
```
172+
173+
### Kill Operation Steps
174+
175+
1. **Set Cancel Flag**: Atomically set the CANCEL_BIT using `fetch_or(CANCEL_BIT)` with `Release` ordering
176+
- Ensures all writes before `kill()` are visible when vCPU thread checks `is_cancelled()` with `Acquire`
177+
178+
2. **Send Signals**: Enter retry loop via `send_signal()`
179+
- Atomically load running, cancel and debug flags via `get_running_cancel_debug()` with `Acquire` ordering
180+
- Continue if `running=true AND cancel=true` (or `running=true AND debug=true` with gdb)
181+
- Exit loop immediately if `running=false OR (cancel=false AND debug=false)`
182+
183+
3. **Signal Delivery**: Send `SIGRTMIN+offset` via `pthread_kill`
184+
- Signal interrupts the `ioctl` that runs the vCPU, causing `EINTR`
185+
- Signal handler is intentionally a no-op
186+
- Returns `VmExit::Cancelled()` when `EINTR` is received
187+
188+
4. **Loop Termination**: The signal loop terminates when:
189+
- vCPU is no longer running (`running=false`), OR
190+
- Cancellation is no longer requested (`cancel=false`)
191+
- See the loop termination proof in the source code for rigorous correctness analysis
192+
193+
## Memory Ordering Guarantees
194+
195+
Hyperlight uses Release-Acquire semantics to ensure correctness across threads:
196+
197+
```mermaid
198+
graph TB
199+
subgraph "vCPU Thread"
200+
A[set_tid<br/>Store tid with Release]
201+
B[set_running<br/>fetch_update RUNNING_BIT<br/>with Release]
202+
C[is_cancelled<br/>Load with Acquire]
203+
D[clear_running<br/>fetch_and with Release]
204+
J[is_debug_interrupted<br/>Load with Acquire]
205+
end
206+
207+
subgraph "Interrupt Thread"
208+
E[kill<br/>fetch_or CANCEL_BIT<br/>with Release]
209+
F[send_signal<br/>Load running with Acquire]
210+
G[Load tid with Acquire]
211+
H[pthread_kill]
212+
I[kill_from_debugger<br/>fetch_or DEBUG_INTERRUPT_BIT<br/>with Release]
213+
end
214+
215+
B -->|Synchronizes-with| F
216+
A -->|Happens-before via B→F| G
217+
E -->|Synchronizes-with| C
218+
D -->|Synchronizes-with| F
219+
I -->|Synchronizes-with| J
220+
```
221+
222+
### Ordering Rules
223+
224+
1. **tid Store → running Load**: `set_tid` (Release) synchronizes with `send_signal` (Acquire), ensuring the interrupt thread sees the correct thread ID.
225+
2. **CANCEL_BIT**: `kill` (Release) synchronizes with `is_cancelled` (Acquire), ensuring the vCPU sees the cancellation request.
226+
3. **clear_running**: `clear_running` (Release) synchronizes with `send_signal` (Acquire), ensuring the interrupt thread stops sending signals when the vCPU stops.
227+
4. **clear_cancel**: Uses Release to ensure operations from the previous run are visible to other threads.
228+
5. **dropped flag**: `set_dropped` (Release) synchronizes with `dropped` (Acquire), ensuring cleanup visibility.
229+
6. **debug_interrupt**: `kill_from_debugger` (Release) synchronizes with `is_debug_interrupted` (Acquire), ensuring the vCPU sees the debug interrupt request.
230+
231+
## Interaction with Host Function Calls
232+
233+
When a guest performs a host function call, the vCPU exits and `RUNNING_BIT` is cleared. `CANCEL_BIT` persists, so if `kill()` is called during the host call, cancellation is detected when the guest attempts to resume.
234+
235+
## Signal Behavior Across Loop Iterations
236+
237+
When the run loop iterates (e.g., for host calls):
238+
1. `clear_running()` sets `running=false`, causing any active `send_signal()` loop to exit.
239+
2. `set_running()` sets `running=true` again.
240+
3. `is_cancelled()` detects the persistent `cancel` flag and returns early.
241+
242+
## Race Conditions
243+
244+
1. **kill() between calls**: `clear_cancel()` at Timing Point 1 ensures `kill()` requests from before the current call are ignored.
245+
2. **kill() before run_vcpu()**: Signals interrupt the guest immediately.
246+
3. **Guest completes before signal**: If the guest finishes naturally, the signal is ignored or causes a retry in the next iteration (handled as stale).
247+
4. **Stale signals**: If a signal from a previous call arrives during a new call, `cancel_requested` (checked at Timing Point 5) will be false, causing a retry.
248+
5. **ABA Problem**: Clearing `CANCEL_BIT` at the start of `run()` breaks any ongoing `send_signal()` loops from previous calls.
249+
250+
## Windows Platform Differences
251+
252+
While the core cancellation mechanism follows the same conceptual model on Windows, there are several platform-specific differences in implementation:
253+
254+
### WindowsInterruptHandle Structure
255+
256+
The `WindowsInterruptHandle` uses a simpler structure compared to Linux:
257+
258+
- **state (AtomicU8)**: Packs three bits (RUNNING_BIT, CANCEL_BIT and DEBUG_INTERRUPT_BIT)
259+
- **partition_handle**: Windows Hyper-V partition handle for the VM
260+
- **dropped (AtomicBool)**: Set when the corresponding VM has been dropped
261+
262+
**Key difference**: No `tid` field is needed because Windows doesn't use thread-targeted signals. No `retry_delay` or `sig_rt_min_offset` fields are needed.
263+
264+
### Kill Operation Differences
265+
266+
On Windows, the `kill()` method uses the Windows Hypervisor Platform (WHP) API `WHvCancelRunVirtualProcessor` instead of POSIX signals to interrupt the vCPU:
267+
268+
**Key differences**:
269+
1. **No signal loop**: Windows calls `WHvCancelRunVirtualProcessor()` at most once in `kill()`, without needing retries
270+
271+
### Why Linux Needs a Retry Loop but Windows Doesn't
272+
273+
The fundamental difference between the platforms lies in how cancellation interacts with the hypervisor:
274+
275+
**Linux (KVM/mshv3)**: POSIX signals can only interrupt the vCPU when the thread is executing kernel code (specifically, during the `ioctl` syscall that runs the vCPU). There is a narrow timing window between when the signal is sent and when the vCPU enters guest mode. If a signal arrives before entering guest mode, it will be delivered but won't interrupt the guest execution. This requires repeatedly sending signals with delays until either:
276+
- The vCPU exits (and consequently RUNNING_BIT becomes false), or
277+
- The cancellation is cleared (CANCEL_BIT becomes false)
278+
279+
**Windows (WHP)**: The `WHvCancelRunVirtualProcessor()` API sets an internal `CancelPending` flag in the Windows Hypervisor Platform. This flag is:
280+
- Set immediately by the API call
281+
- Checked at the start of each VM run loop iteration (before entering guest mode)
282+
- Automatically cleared when it causes a `WHvRunVpExitReasonCanceled` exit
283+
284+
This means if `WHvCancelRunVirtualProcessor()` is called:
285+
- **While the vCPU is running**: The API signals the hypervisor to exit with `WHvRunVpExitReasonCanceled`
286+
- **Before VM runs**: The `CancelPending` flag persists and causes an immediate cancellation on the next VM run attempt
287+
288+
Therefore, we only call `WHvCancelRunVirtualProcessor()` after checking that `RUNNING_BIT` is set. This is important because:
289+
1. If called when not running, the API would still succeed and will unconditionally cancel the next run attempt. This is bad since `kill()` should have no effect if the vCPU is not running
290+
2. This makes the InterruptHandle's `CANCEL_BIT` (which is cleared at the start of each guest function call) the source of truth for whether cancellation is intended for the current call
291+

0 commit comments

Comments
 (0)