-
Notifications
You must be signed in to change notification settings - Fork 74
Add ROCm support to plugin #1014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
driverVersion, | ||
runtimeVersion); | ||
|
||
NCCL_OFI_WARN("HIP flush disabled"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent / seems leftover.
|
||
int nccl_net_ofi_gpu_flush_gpudirect_rdma_writes(void) | ||
{ | ||
return -EPERM; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent here and below.
bf9937d
to
1660ccb
Compare
This work is an attempt to rebase aws#461 on the latest master branch. I've tested functionality with CXI provider on Slingshot 11 Signed-off-by: Thomas Huber <[email protected]> Signed-off-by: Ryan Hankins <[email protected]>
1660ccb
to
58316fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're missing handling flush() on ROCm platforms, but I'm not sure I know exactly what the semantics are there.
libnccl_net_la_LDFLAGS = -module -avoid-version | ||
|
||
if HAVE_ROCM | ||
lib_LTLIBRARIES += librccl-net.la |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the last couple RCCL releases only look for librccl-net.so or librccl-net-ofi.so. So I'm not sure we should be building libnccl-net.so in the rocm case. Have an email out to @thomas-huber about this, but wanted to leave a note so we didn't forget.
Only supporting librccl-net.so / librccl-net-ofi.so on ROCm would certainly make the distribution story better.
DECLARE_CUDA_FUNCTION(cuMemcpy, 4000); | ||
|
||
int nccl_net_ofi_cuda_init(void) | ||
int nccl_net_ofi_gpu_init(void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of renaming some, but not all, of the functions. We should figure out a plan and keep one or the other pattern.
#if HAVE_GPU | ||
case NCCL_PTR_CUDA: | ||
mr_attr->iface = FI_HMEM_CUDA; | ||
mr_attr->iface = HAVE_CUDA ? FI_HMEM_CUDA : FI_HMEM_ROCR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
patterns like this have really screwed us in the past. I'd prefer something like:
#if HAVE_CUDA
mr_attr->iface = FI_HMEM_CUDA;
#elif HAVE_ROCM
mr_attr->iface = FI_HMEM_ROCM;
#else
#error "Invalid platform type"
#endif
int driverVersion = -1; | ||
int runtimeVersion = -1; | ||
|
||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't tend to use this create new scope pattern Is there a reason you have to here?
This work is an attempt to rebase #461 on the latest master branch.
I've tested basic functionality with rccl-tests using CXI provider on Slingshot 11
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.