[TransferEngine] Enable Huawei Ascend Transport for TransferEngine #502
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ascend Transport
The source code path for Ascend Transport is
Mooncake/mooncake-transfer-engine/src/transport/ascend_transport
, which also includes automated build scripts and the README file.Overview
Ascend Transport is a high-performance zero-copy NPU data transfer library with one-sided semantics, directly compatible with Mooncake Transfer Engine. To compile and use the Ascend Transport library, please set the
USE_ASCEND
flag to"ON"
in themooncake-common/common.cmake
file.Ascend Transport supports inter-NPU data transfer using one-sided semantics (currently supports Device to Device; other modes are under development). Users only need to specify the node and memory information at both ends through the Mooncake Transfer Engine interface to achieve high-performance point-to-point data transfer. Ascend Transport abstracts away internal complexities and automatically handles operations such as establishing connections, registering and exchanging memory, and checking transfer status.
New Dependencies
In addition to the dependencies already required by Mooncake, Ascend Transport needs some HCCL-related dependencies:
MPI
New Version Ascend-cann-toolkit
Download
Ascend-cann-toolkit_8.2.RC1.alpha002_linux-x86_64.run
from the Ascend community (choose theaarch64
version if using ARM architecture), enter the directory containing the file, and execute:./Ascend-cann-toolkit_8.2.RC1.alpha002_linux-x86_64.run --install --force source /usr/local/Ascend/ascend-toolkit/set_env.sh
One-Step Build Script
Ascend Transport provides a one-step build script located at
Mooncake/mooncake-transfer-engine/src/transport/ascend_transport/scripts/build_all_with_dependencies.sh
. Copy this script to the desired installation directory and run it. You can also pass an installation path as an argument; if not provided, it defaults to the current directory:This script also supports environments where users cannot perform
git clone
directly. Users can place the source code for dependencies and Mooncake in the target directory, and the script will handle the compilation accordingly.One-Step Installation Script (Without Building Mooncake)
To avoid potential conflicts when running other processes during Mooncake compilation, Ascend Transport offers a solution that separates the build and runtime environments.
After completing the Mooncake build via build_all_with_dependencies.sh, you can run setup_basic_dependencies.sh to install only the required dependencies. Place the generated Mooncake .whl package and libascend_transport_mem.so into the installation directory.
Copy the script to the installation directory and run:
Before use, ensure that
libascend_transport_mem.so
has been copied to/usr/local/Ascend/ascend-toolkit/latest/python/site-packages
, then execute:Build Instructions
Once all dependencies are installed successfully, you can proceed with building Mooncake normally. If errors occur, try setting the following environment variable:
Endpoint Management
Each Huawei NPU card has a dedicated parameter-plane NIC and should be managed by a single
TransferEngine
instance responsible for all its data transfers.Ranktable Management
Ascend Transport does not rely on global Ranktable information. It only needs to obtain the local Ranktable information of the current NPU card. During the initialization of Ascend Transport, it will automatically parse the /etc/hccn.conf file to acquire this information.
Initialization
When using Ascend Transport, the
TransferEngine
must still call theinit
function after construction:The only difference is that the
local_server_name
parameter must now include the physical NPU card ID. The format changes fromip:port
toip:port:npu_x
, e.g.,"0.0.0.0:12345:npu_2"
.Metadata Service
Ascend Transport is compatible with all metadata services currently supported by Mooncake, including
etcd
,redis
,http
, andp2phandshake
. Upon initialization, Ascend Transport registers key NPU card information such asdevice_id
,device_ip
,rank_id
, andserver_ip
.Data Transfer
Ascend Transport supports write/read semantics and automatically determines whether cross-HCCS communication is needed, selecting either HCCS or ROCE as the underlying transport protocol. Users can use the standard Mooncake
getTransferStatus
API to monitor the progress of each transfer request.Fault Handling
Building upon HCCL’s built-in fault handling mechanisms, Ascend Transport implements comprehensive error recovery strategies across multiple stages, including initialization, connection setup, and data transfer. It incorporates retry logic and returns precise error codes based on HCCL collective communication standards when retries fail. For detailed logs, refer to
/root/Ascend/log/plog
.Test Cases
Ascend Transport provides two test files:
mooncake-transfer-engine/example/transfer_engine_ascend_one_sided.cpp
mooncake-transfer-engine/example/transfer_engine_ascend_perf.cpp
You can configure various scenarios (e.g., 1-to-1, 1-to-2, 2-to-1) and performance tests by passing valid parameters to these programs.
Example Commands for Scenario Testing
Start Initiator Node:
Start Target Node:
Example Commands for Performance Testing
Start Initiator Node:
Start Target Node:
Print Description
If you need to obtain information about whether each transport request is cross-hccs and its corresponding execution time, you can enable the related logs by setting the environment variable. Use the following command to turn on the logging:
export ASCEND_TRANSPORT_PRINT=1
Notes
ascned_transport will establish a TCP connection on the host side.This connection uses port (10000 + deviceId). Please avoid using this port for other services to prevent conflicts.