RDMA issue: failed to create extended queue pair (QP): Operation not supported

### 🐛 Describe the bug

Hello torchforgers! 

I've successfully run the install command on 1 node of 8 x H100s, but when I then try to run the GRPO example with:

```sh
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
```

I get a peculiar RDMA error:

```
[RLTrainer-0/1] 2025-10-22 23:21:36 INFO Pushing weights for policy version 1
failed to create extended queue pair (QP): Operation not supported
[0]E1022 23:21:36.634458 2876119 hyperactor/src/proc.rs:1175] unix:@JEdzy6nlnlU4L0ytzv3i4UOA,anon_0_1BsCGXqAsYRV,rdma_manager[0]: actor failure: serving unix:@JEdzy6nlnlU4L0ytzv3i4UOA,anon_0_1BsCGXqAsYRV,rdma_manager[0]: processing error: could not create loopback QP for device rdmap79s0: failed to create queue pair (QP): Invalid argument (os error 22)
```

Any idea what might cause this?

Full stack trace attached:
[forge-bug.txt](https://github.com/user-attachments/files/23064001/forge-bug.txt)

### Versions

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RDMA issue: failed to create extended queue pair (QP): Operation not supported #493

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RDMA issue: failed to create extended queue pair (QP): Operation not supported #493

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions