Skip to content

[Feature]Supporting cudaMemcpyBatchAsync for PortChannels #504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cubele opened this issue Apr 16, 2025 · 1 comment
Open

[Feature]Supporting cudaMemcpyBatchAsync for PortChannels #504

cubele opened this issue Apr 16, 2025 · 1 comment

Comments

@cubele
Copy link

cubele commented Apr 16, 2025

When issuing multiple sends to a PortChannel, the Memcpy kernel launch overhead may lead to bad performance for small message sizes. For example, using MemChannel is much faster than PortChannel for small message size allgather with a direct all-to-all algorithm, as using PortChannel requires (n - 1) MemcpyAsync calls from the proxy with launch overhead in each of them (even excluding the GPU-proxy communication latency). For cases like MoE dispatch, such overhead is much more substantial.

The cudaMemcpyBatchAsync API (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gc02716b3bd21f3d83640ab102bf089f9) allows one kernel launch to submit multiple memcpy jobs to the DMA engine, which theoretically should reduce the overhead in such scenarios to only one kernel launch. However, supporting this in mscclpp may require a new protocol for GPU to submit a batch of copy operations to the proxy.

Have you guys tested the performance improvement of using cudaMemcpyBatchAsync in such scenarios, and do you think it's a worthy optimization?

@chhwang
Copy link
Contributor

chhwang commented Apr 16, 2025

One thing we offer right now is writing a custom proxy so that it can do whatever you want up on triggers from the GPU. An example code is here: https://github.com/microsoft/mscclpp/blob/main/test/allgather_test_host_offloading.cu#L81

Welcome any discussions if you think this feature is not enough or difficult to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants