Closed
Description
Hi, the following code is broken right after a3d8d68
import cupy as cp
from mscclpp import TcpBootstrap, Communicator, Transport
from mscclpp._mscclpp import RawGpuBuffer
bootstrap = TcpBootstrap.create(0, 1)
comm = Communicator(bootstrap)
cp.cuda.Device(0).use()
memory = RawGpuBuffer(1024 * 1024)
data_ptr = memory.data()
comm.register_memory(data_ptr, 1024 * 1024, Transport.IB0)
Traceback (most recent call last):
File "/root/bug.py", line 10, in <module>
comm.register_memory(data_ptr, 1024 * 1024, Transport.IB0)
mscclpp._mscclpp.IbError: (14, 'ibv_reg_mr failed (errno 14) (Ib failure: Bad address)')
Environment:
(base) root@xxx:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
(base) root@xxx:~# nvidia-smi
Sun Apr 6 15:37:56 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03 Driver Version: 535.216.03 CUDA Version: 12.4 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 29C P0 71W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
...8xH100
Metadata
Metadata
Assignees
Labels
No labels