You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When issuing multiple sends to a PortChannel, the Memcpy kernel launch overhead may lead to bad performance for small message sizes. For example, using MemChannel is much faster than PortChannel for small message size allgather with a direct all-to-all algorithm, as using PortChannel requires (n - 1) MemcpyAsync calls from the proxy with launch overhead in each of them (even excluding the GPU-proxy communication latency). For cases like MoE dispatch, such overhead is much more substantial.
When issuing multiple sends to a PortChannel, the Memcpy kernel launch overhead may lead to bad performance for small message sizes. For example, using MemChannel is much faster than PortChannel for small message size allgather with a direct all-to-all algorithm, as using PortChannel requires (n - 1) MemcpyAsync calls from the proxy with launch overhead in each of them (even excluding the GPU-proxy communication latency). For cases like MoE dispatch, such overhead is much more substantial.
The cudaMemcpyBatchAsync API (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gc02716b3bd21f3d83640ab102bf089f9) allows one kernel launch to submit multiple memcpy jobs to the DMA engine, which theoretically should reduce the overhead in such scenarios to only one kernel launch. However, supporting this in mscclpp may require a new protocol for GPU to submit a batch of copy operations to the proxy.
Have you guys tested the performance improvement of using cudaMemcpyBatchAsync in such scenarios, and do you think it's a worthy optimization?
The text was updated successfully, but these errors were encountered: