Skip to content

[TransferEngine] Enable Huawei Ascend Transport for TransferEngine #502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

AscendTransport
Copy link

@AscendTransport AscendTransport commented Jun 16, 2025

Ascend Transport

The source code path for Ascend Transport is Mooncake/mooncake-transfer-engine/src/transport/ascend_transport, which also includes automated build scripts and the README file.

Overview

Ascend Transport is a high-performance zero-copy NPU data transfer library with one-sided semantics, directly compatible with Mooncake Transfer Engine. To compile and use the Ascend Transport library, please set the USE_ASCEND flag to "ON" in the mooncake-common/common.cmake file.

Ascend Transport supports inter-NPU data transfer using one-sided semantics (currently supports Device to Device; other modes are under development). Users only need to specify the node and memory information at both ends through the Mooncake Transfer Engine interface to achieve high-performance point-to-point data transfer. Ascend Transport abstracts away internal complexities and automatically handles operations such as establishing connections, registering and exchanging memory, and checking transfer status.

New Dependencies

In addition to the dependencies already required by Mooncake, Ascend Transport needs some HCCL-related dependencies:

MPI

yum install -y mpich mpich-devel

New Version Ascend-cann-toolkit

rm -rf /etc/Ascend/ascend_cann_install.info

Download Ascend-cann-toolkit_8.2.RC1.alpha002_linux-x86_64.run from the Ascend community (choose the aarch64 version if using ARM architecture), enter the directory containing the file, and execute:

./Ascend-cann-toolkit_8.2.RC1.alpha002_linux-x86_64.run --install --force
source /usr/local/Ascend/ascend-toolkit/set_env.sh

One-Step Build Script

Ascend Transport provides a one-step build script located at Mooncake/mooncake-transfer-engine/src/transport/ascend_transport/scripts/build_all_with_dependencies.sh. Copy this script to the desired installation directory and run it. You can also pass an installation path as an argument; if not provided, it defaults to the current directory:

./build_all_with_dependencies.sh /path/to/install_directory

This script also supports environments where users cannot perform git clone directly. Users can place the source code for dependencies and Mooncake in the target directory, and the script will handle the compilation accordingly.

One-Step Installation Script (Without Building Mooncake)

To avoid potential conflicts when running other processes during Mooncake compilation, Ascend Transport offers a solution that separates the build and runtime environments.

After completing the Mooncake build via build_all_with_dependencies.sh, you can run setup_basic_dependencies.sh to install only the required dependencies. Place the generated Mooncake .whl package and libascend_transport_mem.so into the installation directory.

Copy the script to the installation directory and run:

./setup_basic_dependencies.sh /path/to/install_directory

Before use, ensure that libascend_transport_mem.so has been copied to /usr/local/Ascend/ascend-toolkit/latest/python/site-packages, then execute:

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH

Build Instructions

Once all dependencies are installed successfully, you can proceed with building Mooncake normally. If errors occur, try setting the following environment variable:

export CPLUS_INCLUDE_PATH=$(echo $CPLUS_INCLUDE_PATH | tr ':' '\n' | grep -v "/usr/local/Ascend" | paste -sd: -)

Endpoint Management

Each Huawei NPU card has a dedicated parameter-plane NIC and should be managed by a single TransferEngine instance responsible for all its data transfers.

Ranktable Management

Ascend Transport does not rely on global Ranktable information. It only needs to obtain the local Ranktable information of the current NPU card. During the initialization of Ascend Transport, it will automatically parse the /etc/hccn.conf file to acquire this information.

Initialization

When using Ascend Transport, the TransferEngine must still call the init function after construction:

TransferEngine();

int TransferEngine::init(const std::string &metadata_conn_string,
                         const std::string &local_server_name,
                         const std::string &ip_or_host_name,
                         uint64_t rpc_port)

The only difference is that the local_server_name parameter must now include the physical NPU card ID. The format changes from ip:port to ip:port:npu_x, e.g., "0.0.0.0:12345:npu_2".

Note: This extension of the local_server_name is used internally by Ascend Transport without modifying Mooncake's external API. The segment_desc_name in metadata remains in the original format (ip:port). Therefore, each NPU card must use a unique port that is not occupied.

Metadata Service

Ascend Transport is compatible with all metadata services currently supported by Mooncake, including etcd, redis, http, and p2phandshake. Upon initialization, Ascend Transport registers key NPU card information such as device_id, device_ip, rank_id, and server_ip.

Data Transfer

Ascend Transport supports write/read semantics and automatically determines whether cross-HCCS communication is needed, selecting either HCCS or ROCE as the underlying transport protocol. Users can use the standard Mooncake getTransferStatus API to monitor the progress of each transfer request.

Fault Handling

Building upon HCCL’s built-in fault handling mechanisms, Ascend Transport implements comprehensive error recovery strategies across multiple stages, including initialization, connection setup, and data transfer. It incorporates retry logic and returns precise error codes based on HCCL collective communication standards when retries fail. For detailed logs, refer to /root/Ascend/log/plog.

Test Cases

Ascend Transport provides two test files:

  • Multi-scenario test: mooncake-transfer-engine/example/transfer_engine_ascend_one_sided.cpp
  • Performance test: mooncake-transfer-engine/example/transfer_engine_ascend_perf.cpp

You can configure various scenarios (e.g., 1-to-1, 1-to-2, 2-to-1) and performance tests by passing valid parameters to these programs.

Example Commands for Scenario Testing

Start Initiator Node:

./transfer_engine_ascend_one_sided --metadata_server=P2PHANDSHAKE --local_server_name=10.0.0.0:12345 --protocol=hccl --operation=write --segment_id=10.0.0.0:12346 --device_id=0 --mode=initiator --block_size=8388608

Start Target Node:

./transfer_engine_ascend_one_sided --metadata_server=P2PHANDSHAKE --local_server_name=10.0.0.0:12346 --protocol=hccl --operation=write --device_id=1 --mode=target --block_size=8388608

Example Commands for Performance Testing

Start Initiator Node:

./transfer_engine_ascend_perf --metadata_server=P2PHANDSHAKE --local_server_name=10.0.0.0:12345 --protocol=hccl --operation=write --segment_id=10.0.0.0:12346 --device_id=0 --mode=initiator --block_size=8388608

Start Target Node:

./transfer_engine_ascend_perf --metadata_server=P2PHANDSHAKE --local_server_name=10.0.0.0:12346 --protocol=hccl --operation=write --device_id=1 --mode=target

Print Description

If you need to obtain information about whether each transport request is cross-hccs and its corresponding execution time, you can enable the related logs by setting the environment variable. Use the following command to turn on the logging:

export ASCEND_TRANSPORT_PRINT=1

Notes

ascned_transport will establish a TCP connection on the host side.This connection uses port (10000 + deviceId). Please avoid using this port for other services to prevent conflicts.

@stmatengss
Copy link
Collaborator

Thanks for your contributions! Here's a proposal: could you abstract out XCCL transport for all collective communication libraries (like HCCL, ACCL, and NCCL)? If that's too difficult or not possible, please correct me.

@espace1
Copy link

espace1 commented Jun 17, 2025

[
修改.txt
](url)

@AscendTransport AscendTransport requested a review from alogfans June 18, 2025 02:01
@AscendTransport AscendTransport force-pushed the main branch 15 times, most recently from 0536c3c to 9a5b035 Compare June 19, 2025 11:51
@AscendTransport
Copy link
Author

finish

@doujiang24
Copy link
Collaborator

@AscendTransport Cool stuff! If it's possible to build the pip package in the ci pipeline?

@AscendTransport AscendTransport force-pushed the main branch 5 times, most recently from da09b7b to ad2d642 Compare June 23, 2025 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants