[Feature]Supporting cudaMemcpyBatchAsync for PortChannels #504

cubele · 2025-04-16T07:55:46Z

When issuing multiple sends to a PortChannel, the Memcpy kernel launch overhead may lead to bad performance for small message sizes. For example, using MemChannel is much faster than PortChannel for small message size allgather with a direct all-to-all algorithm, as using PortChannel requires (n - 1) MemcpyAsync calls from the proxy with launch overhead in each of them (even excluding the GPU-proxy communication latency). For cases like MoE dispatch, such overhead is much more substantial.

The cudaMemcpyBatchAsync API (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gc02716b3bd21f3d83640ab102bf089f9) allows one kernel launch to submit multiple memcpy jobs to the DMA engine, which theoretically should reduce the overhead in such scenarios to only one kernel launch. However, supporting this in mscclpp may require a new protocol for GPU to submit a batch of copy operations to the proxy.

Have you guys tested the performance improvement of using cudaMemcpyBatchAsync in such scenarios, and do you think it's a worthy optimization?

chhwang · 2025-04-16T17:51:02Z

One thing we offer right now is writing a custom proxy so that it can do whatever you want up on triggers from the GPU. An example code is here: https://github.com/microsoft/mscclpp/blob/main/test/allgather_test_host_offloading.cu#L81

Welcome any discussions if you think this feature is not enough or difficult to use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]Supporting cudaMemcpyBatchAsync for PortChannels #504

[Feature]Supporting cudaMemcpyBatchAsync for PortChannels #504

cubele commented Apr 16, 2025

chhwang commented Apr 16, 2025

[Feature]Supporting cudaMemcpyBatchAsync for PortChannels #504

[Feature]Supporting cudaMemcpyBatchAsync for PortChannels #504

Comments

cubele commented Apr 16, 2025

chhwang commented Apr 16, 2025