|
| 1 | +# Ascend Transport |
| 2 | + |
| 3 | +The source code path for Ascend Transport is `Mooncake/mooncake-transfer-engine/src/transport/ascend_transport`, which also includes automated build scripts and the README file. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Ascend Transport is a high-performance zero-copy NPU data transfer library with one-sided semantics, directly compatible with Mooncake Transfer Engine. To compile and use the Ascend Transport library, please set the `USE_ASCEND` flag to `"ON"` in the `mooncake-common/common.cmake` file. |
| 8 | + |
| 9 | +Ascend Transport supports inter-NPU data transfer using one-sided semantics (currently supports Device to Device; other modes are under development). Users only need to specify the node and memory information at both ends through the Mooncake Transfer Engine interface to achieve high-performance point-to-point data transfer. Ascend Transport abstracts away internal complexities and automatically handles operations such as establishing connections, registering and exchanging memory, and checking transfer status. |
| 10 | + |
| 11 | +### New Dependencies |
| 12 | + |
| 13 | +In addition to the dependencies already required by Mooncake, Ascend Transport needs some HCCL-related dependencies: |
| 14 | + |
| 15 | +**MPI** |
| 16 | +```bash |
| 17 | +yum install -y mpich mpich-devel |
| 18 | +``` |
| 19 | + |
| 20 | +**New Version Ascend-cann-toolkit** |
| 21 | + |
| 22 | +```bash |
| 23 | +rm -rf /etc/Ascend/ascend_cann_install.info |
| 24 | +``` |
| 25 | + |
| 26 | +Download `Ascend-cann-toolkit_8.2.RC1.alpha002_linux-x86_64.run` from the Ascend community (choose the `aarch64` version if using ARM architecture), enter the directory containing the file, and execute: |
| 27 | +```bash |
| 28 | +./Ascend-cann-toolkit_8.2.RC1.alpha002_linux-x86_64.run --install --force |
| 29 | +source /usr/local/Ascend/ascend-toolkit/set_env.sh |
| 30 | +``` |
| 31 | + |
| 32 | +### One-Step Build Script |
| 33 | + |
| 34 | +Ascend Transport provides a one-step build script located at `Mooncake/mooncake-transfer-engine/src/transport/ascend_transport/scripts/build_all_with_dependencies.sh`. Copy this script to the desired installation directory and run it. You can also pass an installation path as an argument; if not provided, it defaults to the current directory: |
| 35 | + |
| 36 | +```bash |
| 37 | +./build_all_with_dependencies.sh /path/to/install_directory |
| 38 | +``` |
| 39 | + |
| 40 | +This script also supports environments where users cannot perform `git clone` directly. Users can place the source code for dependencies and Mooncake in the target directory, and the script will handle the compilation accordingly. |
| 41 | + |
| 42 | +### One-Step Installation Script (Without Building Mooncake) |
| 43 | + |
| 44 | +To avoid potential conflicts when running other processes during Mooncake compilation, Ascend Transport offers a solution that separates the build and runtime environments. |
| 45 | + |
| 46 | +After completing the Mooncake build via build_all_with_dependencies.sh, you can run setup_basic_dependencies.sh to install only the required dependencies. Place the generated Mooncake .whl package and libascend_transport_mem.so into the installation directory. |
| 47 | + |
| 48 | +Copy the script to the installation directory and run: |
| 49 | +```bash |
| 50 | +./setup_basic_dependencies.sh /path/to/install_directory |
| 51 | +``` |
| 52 | + |
| 53 | +Before use, ensure that `libascend_transport_mem.so` has been copied to `/usr/local/Ascend/ascend-toolkit/latest/python/site-packages`, then execute: |
| 54 | +```bash |
| 55 | +export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH |
| 56 | +``` |
| 57 | + |
| 58 | +### Build Instructions |
| 59 | + |
| 60 | +Once all dependencies are installed successfully, you can proceed with building Mooncake normally. If errors occur, try setting the following environment variable: |
| 61 | +```bash |
| 62 | +export CPLUS_INCLUDE_PATH=$(echo $CPLUS_INCLUDE_PATH | tr ':' '\n' | grep -v "/usr/local/Ascend" | paste -sd: -) |
| 63 | +``` |
| 64 | + |
| 65 | +### Endpoint Management |
| 66 | + |
| 67 | +Each Huawei NPU card has a dedicated parameter-plane NIC and should be managed by a single `TransferEngine` instance responsible for all its data transfers. |
| 68 | + |
| 69 | +### Ranktable Management |
| 70 | +Ascend Transport does not rely on global Ranktable information. It only needs to obtain the local Ranktable information of the current NPU card. During the initialization of Ascend Transport, it will automatically parse the /etc/hccn.conf file to acquire this information. |
| 71 | + |
| 72 | +### Initialization |
| 73 | + |
| 74 | +When using Ascend Transport, the `TransferEngine` must still call the `init` function after construction: |
| 75 | + |
| 76 | +```cpp |
| 77 | +TransferEngine(); |
| 78 | + |
| 79 | +int TransferEngine::init(const std::string &metadata_conn_string, |
| 80 | + const std::string &local_server_name, |
| 81 | + const std::string &ip_or_host_name, |
| 82 | + uint64_t rpc_port) |
| 83 | +``` |
| 84 | +
|
| 85 | +The only difference is that the `local_server_name` parameter must now include the physical NPU card ID. The format changes from `ip:port` to `ip:port:npu_x`, e.g., `"0.0.0.0:12345:npu_2"`. |
| 86 | +
|
| 87 | +> **Note**: This extension of the `local_server_name` is used internally by Ascend Transport without modifying Mooncake's external API. The `segment_desc_name` in metadata remains in the original format (`ip:port`). Therefore, each NPU card must use a unique port that is not occupied. |
| 88 | +
|
| 89 | +### Metadata Service |
| 90 | +
|
| 91 | +Ascend Transport is compatible with all metadata services currently supported by Mooncake, including `etcd`, `redis`, `http`, and `p2phandshake`. Upon initialization, Ascend Transport registers key NPU card information such as `device_id`, `device_ip`, `rank_id`, and `server_ip`. |
| 92 | +
|
| 93 | +### Data Transfer |
| 94 | +
|
| 95 | +Ascend Transport supports write/read semantics and automatically determines whether cross-HCCS communication is needed, selecting either HCCS or ROCE as the underlying transport protocol. Users can use the standard Mooncake `getTransferStatus` API to monitor the progress of each transfer request. |
| 96 | +
|
| 97 | +### Fault Handling |
| 98 | +
|
| 99 | +Building upon HCCL’s built-in fault handling mechanisms, Ascend Transport implements comprehensive error recovery strategies across multiple stages, including initialization, connection setup, and data transfer. It incorporates retry logic and returns precise error codes based on HCCL collective communication standards when retries fail. For detailed logs, refer to `/root/Ascend/log/plog`. |
| 100 | +
|
| 101 | +### Test Cases |
| 102 | +
|
| 103 | +Ascend Transport provides two test files: |
| 104 | +- Multi-scenario test: `mooncake-transfer-engine/example/transfer_engine_ascend_one_sided.cpp` |
| 105 | +- Performance test: `mooncake-transfer-engine/example/transfer_engine_ascend_perf.cpp` |
| 106 | +
|
| 107 | +You can configure various scenarios (e.g., 1-to-1, 1-to-2, 2-to-1) and performance tests by passing valid parameters to these programs. |
| 108 | +
|
| 109 | +#### Example Commands for Scenario Testing |
| 110 | +
|
| 111 | +**Start Initiator Node:** |
| 112 | +```bash |
| 113 | +./transfer_engine_ascend_one_sided --metadata_server=P2PHANDSHAKE --local_server_name=10.0.0.0:12345 --protocol=hccl --operation=write --segment_id=10.0.0.0:12346 --device_id=0 --mode=initiator --block_size=8388608 |
| 114 | +``` |
| 115 | + |
| 116 | +**Start Target Node:** |
| 117 | +```bash |
| 118 | +./transfer_engine_ascend_one_sided --metadata_server=P2PHANDSHAKE --local_server_name=10.0.0.0:12346 --protocol=hccl --operation=write --device_id=1 --mode=target --block_size=8388608 |
| 119 | +``` |
| 120 | + |
| 121 | +#### Example Commands for Performance Testing |
| 122 | + |
| 123 | +**Start Initiator Node:** |
| 124 | +```bash |
| 125 | +./transfer_engine_ascend_perf --metadata_server=P2PHANDSHAKE --local_server_name=10.0.0.0:12345 --protocol=hccl --operation=write --segment_id=10.0.0.0:12346 --device_id=0 --mode=initiator --block_size=8388608 |
| 126 | +``` |
| 127 | + |
| 128 | +**Start Target Node:** |
| 129 | +```bash |
| 130 | +./transfer_engine_ascend_perf --metadata_server=P2PHANDSHAKE --local_server_name=10.0.0.0:12346 --protocol=hccl --operation=write --device_id=1 --mode=target |
| 131 | +``` |
| 132 | + |
| 133 | +### Print Description |
| 134 | +If you need to obtain information about whether each transport request is cross-hccs and its corresponding execution time, you can enable the related logs by setting the environment variable. Use the following command to turn on the logging: |
| 135 | + |
| 136 | +```bash |
| 137 | +export ASCEND_TRANSPORT_PRINT=1 |
| 138 | +``` |
| 139 | + |
| 140 | +### Notes |
| 141 | +ascned_transport will establish a TCP connection on the host side.This connection uses port (10000 + deviceId). Please avoid using this port for other services to prevent conflicts. |
0 commit comments