Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 183 additions & 0 deletions prov/efa/docs/util_cq_bypass.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# Util CQ Bypass Optimization

## Overview

Completion queues ([fi_cq](https://ofiwg.github.io/libfabric/main/man/fi_cq.3.html)) are critical resources in libfabric that applications poll to receive completion notifications for asynchronous operations. This document illustrates a performance optimization that eliminates unnecessary memory copies in the completion path.

## Problem Statement

Before the optimization, libfabric providers followed a two-stage completion process:
1. Provider receives completion events from the NIC device
2. Provider writes completion entries to a staging buffer (util_cq)
3. Application polls the completion queue
4. Util CQ copies entries from staging buffer to application-provided buffer

This approach introduced additional memory copy operations that added latency and consumed CPU cycles, particularly impacting high-frequency completion scenarios.

## Solution: Direct Completion Writing

The optimization bypasses the util CQ staging buffer by enabling providers to write completion entries directly to the application-provided buffer. This eliminates the intermediate staging step and its associated memory copy overhead.

### Performance Benefits
- **Reduced Latency**: Eliminates one memory copy operation per completion
- **Lower CPU Overhead**: Fewer memory operations reduce CPU utilization
- **Improved Cache Efficiency**: Direct writes reduce memory bandwidth usage
- **Better Scalability**: Performance gains increase with completion frequency

### When This Optimization Applies
This optimization is currently available in:
- **EFA-direct fabric**: Direct completion writing bypasses util CQ staging

This optimization is **not available** when:
- **Counter is bound to CQ**: Applications can call `fi_cntr_read()` to get completion counts without reading CQ entries, requiring staging for deferred CQ reads
- **EFA fabric (non-direct)**: Staging is required to support `fi_cq_read()` with NULL buffer and 0 entries for provider progress in `FI_PROGRESS_MANUAL` mode

The key factor is whether CQ entry staging is necessary. Staging is required when applications need to:
- Progress the provider without consuming completion entries
- Read completion counts independently of CQ entry consumption

## Implementation Comparison

```mermaid
sequenceDiagram
title Before Optimization - Using Util CQ Staging Buffer
participant DeviceCQ as Device CQ
participant Provider as EFA Provider
participant UtilCQ as Util CQ
participant Application

Note over DeviceCQ,Application: Completion Path with Staging Buffer
DeviceCQ->>Provider: Completion Event
Provider->>UtilCQ: Write completion entry to staging buffer
Application->>UtilCQ: Poll for completion
UtilCQ->>Application: Copy entry to application buffer

Note over DeviceCQ,Application: Additional memcpy overhead
```


```mermaid
sequenceDiagram
title After Optimization - Direct Completion Writing
participant DeviceCQ as Device CQ
participant Provider as EFA Provider
participant Application

Note over DeviceCQ,Application: Optimized Completion Path
DeviceCQ->>Provider: Completion Event
Provider->>Application: Write completion entry directly to application buffer
Application->>Application: Poll completion buffer

Note over DeviceCQ,Application: Eliminated extra memcpy through staging buffer
```

## Special Case: EP Close Handling
During endpoint closure, both EFA and EFA-direct fabrics flush any outstanding completion entries from the device CQ and stage them into util CQ. To handle this corner case, the optimized CQ read path includes a lightweight check at the beginning to verify if util CQ is empty. If staged entries exist, they are read from util CQ first. This isEmpty check adds minimal overhead to the fast path.

```mermaid
sequenceDiagram
title Optimized Path with EP Close Handling
participant DeviceCQ as Device CQ
participant Provider as EFA Provider
participant UtilCQ as Util CQ
participant Application

Note over DeviceCQ,Application: Normal Operation - Direct Path
DeviceCQ->>Provider: Completion Event
Provider->>Application: Write directly to application buffer

Note over DeviceCQ,Application: EP Close Scenario
Provider->>UtilCQ: Flush outstanding entries during EP close
Application->>Provider: Poll for completion
Provider->>Provider: Check if util CQ is empty (lightweight)
alt Util CQ has staged entries
Provider->>UtilCQ: Read from staging buffer
UtilCQ->>Application: Return staged entries
else Util CQ is empty
Provider->>Application: Write directly to application buffer
end

Note over DeviceCQ,Application: Minimal overhead from isEmpty check
```

## Error Handling in Direct CQ Read Path

The optimized CQ read path maintains proper error handling semantics while preserving the poll state across multiple function calls. When a failed completion is encountered, the provider stops processing and maintains the device CQ in a poll-active state until the error is consumed.

### Error Handling Flow

Consider a scenario with 3 completion entries in the device CQ: 2 successful completions followed by 1 failed completion.

```mermaid
sequenceDiagram
title Error Handling - First fi_cq_read Call
participant DeviceCQ as Device CQ
participant Provider as EFA Provider
participant Application

Note over DeviceCQ,Application: Device CQ contains: [Success, Success, Error]
Application->>Provider: fi_cq_read(cq_fid, buf, 3)
Provider->>DeviceCQ: Start polling (ibv_start_poll)
DeviceCQ->>Provider: CQE 1 (Success)
Provider->>Application: Write entry 1 to buffer
DeviceCQ->>Provider: CQE 2 (Success)
Provider->>Application: Write entry 2 to buffer
DeviceCQ->>Provider: CQE 3 (Error)
Provider->>Provider: Stop processing at error
Provider->>Application: Return 2 (successful entries)

Note over DeviceCQ,Provider: Device CQ remains in poll-active state
Note over DeviceCQ,Provider: Error CQE cached for next call
```

```mermaid
sequenceDiagram
title Error Handling - Second fi_cq_read Call
participant DeviceCQ as Device CQ
participant Provider as EFA Provider
participant Application

Note over DeviceCQ,Provider: Device CQ still in poll-active state
Note over DeviceCQ,Provider: Cached error CQE from previous call
Application->>Provider: fi_cq_read(cq_fid, buf, count)
Provider->>Provider: Process cached error CQE
Provider->>Application: Return -FI_EAVAIL

Note over DeviceCQ,Provider: Device CQ remains in poll-active state
Note over DeviceCQ,Provider: Error must be consumed via fi_cq_readerr
```

```mermaid
sequenceDiagram
title Error Handling - fi_cq_readerr Call
participant DeviceCQ as Device CQ
participant Provider as EFA Provider
participant Application

Note over DeviceCQ,Provider: Device CQ in poll-active state with cached error
Application->>Provider: fi_cq_readerr(cq_fid, err_buf, flags)
Provider->>Application: Copy error details to err_buf
Provider->>DeviceCQ: End polling (ibv_end_poll)
Provider->>Application: Return 1 (error consumed)

Note over DeviceCQ,Provider: Device CQ poll state ended
Note over DeviceCQ,Provider: Next fi_cq_read can start fresh polling
```

### Key Error Handling Characteristics

- **Poll State Persistence**: The device CQ remains in poll-active state across multiple `fi_cq_read()` calls until the error is consumed
- **Error Caching**: Failed CQEs are cached to ensure they're returned on subsequent `fi_cq_read()` calls
- **Atomic Error Consumption**: Only `fi_cq_readerr()` ends the poll state and allows progression to subsequent CQEs
- **Consistent Semantics**: Error handling behavior matches standard libfabric CQ semantics despite the direct write optimization

## Code References

**Util CQ staging implementation:**
- [util_cq.c:263](https://github.com/ofiwg/libfabric/blob/main/prov/util/src/util_cq.c#L263) - Core util CQ staging logic

**EFA provider's previous staging approach:**
- [efa_cq.c:137-168](https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/efa_cq.c#L137-L168) - How EFA used util CQ for staging completions

**EFA provider's optimized direct completion path:**
- [efa_cq.c:686-783](https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/efa_cq.c#L686-L783) - New CQ read implementation with direct writing