Skip to content

[Bug] IB Error after removing requirement for CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED #496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
liangyuRain opened this issue Apr 6, 2025 · 1 comment

Comments

@liangyuRain
Copy link

Hi, the following code is broken right after a3d8d68

import cupy as cp
from mscclpp import TcpBootstrap, Communicator, Transport
from mscclpp._mscclpp import RawGpuBuffer
bootstrap = TcpBootstrap.create(0, 1)
comm = Communicator(bootstrap)

cp.cuda.Device(0).use()
memory = RawGpuBuffer(1024 * 1024)
data_ptr = memory.data()
comm.register_memory(data_ptr, 1024 * 1024, Transport.IB0)
Traceback (most recent call last):                                                                             
  File "/root/bug.py", line 10, in <module>                                                                    
    comm.register_memory(data_ptr, 1024 * 1024, Transport.IB0)                                                 
mscclpp._mscclpp.IbError: (14, 'ibv_reg_mr failed (errno 14) (Ib failure: Bad address)')    

Environment:

(base) root@xxx:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
(base) root@xxx:~# nvidia-smi
Sun Apr  6 15:37:56 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:19:00.0 Off |                    0 |
| N/A   29C    P0              71W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
...8xH100
@Binyang2014
Copy link
Contributor

Thanks for reporting, looks like related to this issue: NVIDIA/gdrcopy#266, will try ibv_reg_dmabuf_mr API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants