🐛 Describe the bug
Hello torchforgers!
I've successfully run the install command on 1 node of 8 x H100s, but when I then try to run the GRPO example with:
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
I get a peculiar RDMA error:
[RLTrainer-0/1] 2025-10-22 23:21:36 INFO Pushing weights for policy version 1
failed to create extended queue pair (QP): Operation not supported
[0]E1022 23:21:36.634458 2876119 hyperactor/src/proc.rs:1175] unix:@JEdzy6nlnlU4L0ytzv3i4UOA,anon_0_1BsCGXqAsYRV,rdma_manager[0]: actor failure: serving unix:@JEdzy6nlnlU4L0ytzv3i4UOA,anon_0_1BsCGXqAsYRV,rdma_manager[0]: processing error: could not create loopback QP for device rdmap79s0: failed to create queue pair (QP): Invalid argument (os error 22)
Any idea what might cause this?
Full stack trace attached:
forge-bug.txt
Versions
No response