Skip to content

[Bug] IB Error after removing requirement for CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED #496

Closed
@liangyuRain

Description

@liangyuRain

Hi, the following code is broken right after a3d8d68

import cupy as cp
from mscclpp import TcpBootstrap, Communicator, Transport
from mscclpp._mscclpp import RawGpuBuffer
bootstrap = TcpBootstrap.create(0, 1)
comm = Communicator(bootstrap)

cp.cuda.Device(0).use()
memory = RawGpuBuffer(1024 * 1024)
data_ptr = memory.data()
comm.register_memory(data_ptr, 1024 * 1024, Transport.IB0)
Traceback (most recent call last):                                                                             
  File "/root/bug.py", line 10, in <module>                                                                    
    comm.register_memory(data_ptr, 1024 * 1024, Transport.IB0)                                                 
mscclpp._mscclpp.IbError: (14, 'ibv_reg_mr failed (errno 14) (Ib failure: Bad address)')    

Environment:

(base) root@xxx:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
(base) root@xxx:~# nvidia-smi
Sun Apr  6 15:37:56 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:19:00.0 Off |                    0 |
| N/A   29C    P0              71W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
...8xH100

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions