Skip to content

Conversation

thomas-huber
Copy link

This work is an attempt to rebase #461 on the latest master branch.

I've tested basic functionality with rccl-tests using CXI provider on Slingshot 11

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@thomas-huber thomas-huber requested review from bwbarrett and a team as code owners October 7, 2025 04:22
driverVersion,
runtimeVersion);

NCCL_OFI_WARN("HIP flush disabled");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent / seems leftover.


int nccl_net_ofi_gpu_flush_gpudirect_rdma_writes(void)
{
return -EPERM;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent here and below.

@bwbarrett bwbarrett force-pushed the thomas/rccl-support branch 2 times, most recently from bf9937d to 1660ccb Compare October 8, 2025 16:34
This work is an attempt to rebase aws#461
on the latest master branch.

I've tested functionality with CXI provider on Slingshot 11

Signed-off-by: Thomas Huber <[email protected]>
Signed-off-by: Ryan Hankins <[email protected]>
@bwbarrett bwbarrett force-pushed the thomas/rccl-support branch from 1660ccb to 58316fe Compare October 8, 2025 16:47
Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're missing handling flush() on ROCm platforms, but I'm not sure I know exactly what the semantics are there.

libnccl_net_la_LDFLAGS = -module -avoid-version

if HAVE_ROCM
lib_LTLIBRARIES += librccl-net.la
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the last couple RCCL releases only look for librccl-net.so or librccl-net-ofi.so. So I'm not sure we should be building libnccl-net.so in the rocm case. Have an email out to @thomas-huber about this, but wanted to leave a note so we didn't forget.

Only supporting librccl-net.so / librccl-net-ofi.so on ROCm would certainly make the distribution story better.

DECLARE_CUDA_FUNCTION(cuMemcpy, 4000);

int nccl_net_ofi_cuda_init(void)
int nccl_net_ofi_gpu_init(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of renaming some, but not all, of the functions. We should figure out a plan and keep one or the other pattern.

#if HAVE_GPU
case NCCL_PTR_CUDA:
mr_attr->iface = FI_HMEM_CUDA;
mr_attr->iface = HAVE_CUDA ? FI_HMEM_CUDA : FI_HMEM_ROCR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patterns like this have really screwed us in the past. I'd prefer something like:

#if HAVE_CUDA
  mr_attr->iface = FI_HMEM_CUDA;
#elif HAVE_ROCM
  mr_attr->iface = FI_HMEM_ROCM;
#else
  #error "Invalid platform type"
#endif

int driverVersion = -1;
int runtimeVersion = -1;

{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't tend to use this create new scope pattern Is there a reason you have to here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants