volcengine · wuxibin89 · Oct 21, 2025 · Sep 23, 2025 · Sep 23, 2025 · Sep 24, 2025
@@ -0,0 +1,128 @@
+# TransferQueue Data System
+
+Last updated: 09/28/2025.
+
+This doc introduce [TransferQueue](https://github.com/TransferQueue/TransferQueue), an asynchronous streaming data management system for efficient post-training.
+
+
+<h2 id="overview"> Overview</h2>
+
+TransferQueue is a high-performance data storage and transfer system with panoramic data visibility and streaming scheduling capabilities, optimized for efficient dataflow in post-training workflows.
+
+<p align="center">
+  <img src="https://cdn.nlark.com/yuque/0/2025/png/23208217/1758696193102-a5654375-65a1-4e06-9c63-142b59df90b8.png" width="70%">
+</p>
+
+
+TransferQueue offers **fine-grained, sample-level** data management capabilities, serving as a data gateway that decouples explicit data dependencies across computational tasks. This enables a divide-and-conquer approach, significantly simplifying the design of the algorithm controller.
+
+
+<p align="center">
+  <img src="https://cdn.nlark.com/yuque/0/2025/png/23208217/1758696791245-fa7baf96-46af-4c19-8606-28ffadc4556c.png" width="70%">
+</p>
+
+
+
+
+<h2 id="components"> Components</h2>
+
+
+
+### Control Plane: Panoramic Data Management  
+
+In the control plane, `TransferQueueController` tracks the **production status** and **consumption status** of each training sample as metadata. When all the required data fields are ready (i.e., written to the `TransferQueueStorage`), we know that this data sample can be consumed by downstream tasks. 
+
+For consumption status, we record the consumption records for each computational task (e.g., `generate_sequences`, `compute_log_prob`, etc.). Therefore, even different computation tasks require the same data field, they can consume the data independently without interfering with each other.
+
+
+<p align="center">
+  <img src="https://cdn.nlark.com/yuque/0/2025/png/23208217/1758696820173-456c1784-42ba-40c8-a292-2ff1401f49c5.png" width="70%">
+</p>
+
+
+> In the future, we plan to support **load-balancing** and **dynamic batching** capabilities in the control plane. Besides, we will support data management for disaggregated frameworks where each rank manages the data retrieval by itself, rather than coordinated by a single controller.
+
+### Data Plane: Distributed Data Storage
+
+In the data plane, `TransferQueueStorageSimpleUnit` serves as a naive storage unit based on CPU memory, responsible for the actual storage and retrieval of data. Each storage unit can be deployed on a separate node, allowing for distributed data management.
+
+`TransferQueueStorageSimpleUnit` employs a 2D data structure as follows:
+
+- Each row corresponds to a training sample, assigned a unique index within the corresponding global batch.
+- Each column represents the input/output data fields for computational tasks.
+
+This data structure design is motivated by the computational characteristics of the post-training process, where each training sample is generated in a relayed manner across task pipelines. It provides an accurate addressing capability, which allows fine-grained, concurrent data read/write operations in a streaming manner.
+
+<p align="center">
+  <img src="https://cdn.nlark.com/yuque/0/2025/png/23208217/1758696805154-3817011f-84e6-40d0-a80c-58b7e3e5f6a7.png" width="70%">
+</p>
+
+
+> In the future, we plan to implement a **general storage abstraction layer** to support various storage backends. Through this abstraction, we hope to integrate high-performance storage solutions such as [MoonCakeStore](https://github.com/kvcache-ai/Mooncake) to support device-to-device data transfer through RDMA, further enhancing data transfer efficiency for large-scale data.
+
+
+### User Interface: Asynchronous & Synchronous Client
+
+
+The interaction workflow of TransferQueue system is as follows:
+
+1. A process sends a read request to the `TransferQueueController`.
+2. `TransferQueueController` scans the production and consumption metadata for each sample (row), and dynamically assembles a micro-batch metadata according to the load-balancing policy. This mechanism enables sample-level data scheduling.
+3. The process retrieves the actual data from distributed storage units using the metadata provided by the controller.
+
+To simplify the usage of TransferQueue, we have encapsulated this process into `AsyncTransferQueueClient` and `TransferQueueClient`. These clients provide both asynchronous and synchronous interfaces for data transfer, allowing users to easily integrate TransferQueue to their framework.
+
+
+> In the future, we will provide a `StreamingDataLoader` interface for disaggregated frameworks as discussed in [RFC#2662](https://github.com/volcengine/verl/discussions/2662). Leveraging this abstraction, each rank can automatically get its own data like `DataLoader` in PyTorch. The TransferQueue system will handle the underlying data scheduling and transfer logic caused by different parallelism strategies, significantly simplifying the design of disaggregated frameworks.
+
+
+<h2 id="show-cases"> Show Cases</h2>
+
+### General Usage
+
+The primary interaction points are `AsyncTransferQueueClient` and `TransferQueueClient`, serving as the communication interface with the TransferQueue system.
+
+Core interfaces:
+
+- (async_)get_meta(data_fields: list[str], batch_size:int, global_step:int, get_n_samples:bool, task_name:str) -> BatchMeta
+- (async_)get_data(metadata:BatchMeta) -> TensorDict
+- (async_)put(data:TensorDict, metadata:BatchMeta, global_step)
+- (async_)clear(global_step: int)
+
+
+We will soon release a detailed tutorial and API documentation.
+
+
+### verl Example
+
+
+The primary motivation for integrating TransferQueue to verl now is to **alleviate the data transfer bottleneck of the single controller `RayPPOTrainer`**. Currently,  all `DataProto` objects must be routed through `RayPPOTrainer`, resulting in a single point bottleneck of the whole post-training system. 
+
+![verl_dataflow_DataProto](https://cdn.nlark.com/yuque/0/2025/jpeg/23208217/1758704289414-bcc54228-716b-4d4a-ad3b-f9ace6d10fcf.jpeg)
+
+Leveraging TransferQueue, we separate experience data transfer from metadata dispatch by
+
+- Replacing `DataProto` with `BatchMeta` (metadata) and `TensorDict` (actual data) structures
+- Preserving verl's original Dispatch/Collect logic via BatchMeta (maintaining single-controller debuggability)
+- Accelerating data transfer by TransferQueue's distributed storage units
+
+![verl_dataflow_TransferQueue](https://cdn.nlark.com/yuque/0/2025/jpeg/23208217/1758704301666-0807dc06-766c-4a2d-9cde-889a6bb56b34.jpeg)
+
+
+You may refer to the [recipe](https://github.com/TransferQueue/TransferQueue/tree/dev/recipe/simple_use_case), where we mimic the verl usage in both async & sync scenarios. 
+
+
+
+
+
+<h2 id="citation"> Citation</h2>
+Please kindly cite our paper if you find this repo is useful:
+
+```bibtex
+@article{han2025asyncflow,
+  title={AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training},
+  author={Han, Zhenyu and You, Ansheng and Wang, Haibo and Luo, Kui and Yang, Guang and Shi, Wenqi and Chen, Menglong and Zhang, Sicheng and Lan, Zeshun and Deng, Chunshi and others},
+  journal={arXiv preprint arXiv:2507.01663},
+  year={2025}
+}
+```
@@ -125,6 +125,7 @@ verl is fast with:
    advance/one_step_off
    advance/agent_loop
    advance/fully_async
+   data/transfer_queue.md
 
 .. toctree::
    :maxdepth: 1

diff --git a/recipe/transfer_queue/agent_loop.py b/recipe/transfer_queue/agent_loop.py
@@ -0,0 +1,76 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import ray
+from transfer_queue import BatchMeta
+
+import verl.experimental.agent_loop.agent_loop as agent_loop
+from verl import DataProto
+
+
+class AgentLoopManager(agent_loop.AgentLoopManager):
+    def generate_sequences(self, prompts: BatchMeta) -> BatchMeta:
+        """Split input batch and dispatch to agent loop workers.
+
+        Args:
+            prompts (BatchMeta): Input batch.
+
+        Returns:
+            BatchMeta: Output batch metadata.
+        """
+
+        if self.rm_micro_batch_size and len(prompts) % self.rm_micro_batch_size != 0:
+            raise ValueError(
+                f"The length of prompts {len(prompts)} cannot divide the world size of rm_wg {self.rm_micro_batch_size}"
+            )
+        if self.config.actor_rollout_ref.rollout.free_cache_engine:
+            self.wake_up()
+        chunkes = prompts.chunk(len(self.agent_loop_workers))
+        outputs = ray.get(
+            [
+                worker.generate_sequences.remote(chunk)
+                for worker, chunk in zip(self.agent_loop_workers, chunkes, strict=True)
+            ]
+        )
+        output = BatchMeta.concat(outputs)
+        if self.config.actor_rollout_ref.rollout.free_cache_engine:
+            self.sleep()
+
+        # calculate performance metrics
+        metrics = [output.extra_info.pop("metrics") for output in outputs]  # List[List[Dict[str, str]]]
+        timing = self._performance_metrics(metrics, output)
+
+        output.set_extra_info("timing", timing)
+        return output
+
+    def _performance_metrics(self, metrics: list[list[dict[str, str]]], output: DataProto) -> dict[str, float]:
+        timing = {}
+        t_generate_sequences = np.array([metric["generate_sequences"] for chunk in metrics for metric in chunk])
+        t_tool_calls = np.array([metric["tool_calls"] for chunk in metrics for metric in chunk])
+        timing["agent_loop/generate_sequences/min"] = t_generate_sequences.min()
+        timing["agent_loop/generate_sequences/max"] = t_generate_sequences.max()
+        timing["agent_loop/generate_sequences/mean"] = t_generate_sequences.mean()
+        timing["agent_loop/tool_calls/min"] = t_tool_calls.min()
+        timing["agent_loop/tool_calls/max"] = t_tool_calls.max()
+        timing["agent_loop/tool_calls/mean"] = t_tool_calls.mean()
+
+        return timing
+
+    def create_transferqueue_client(self, controller_infos, storage_infos, role):
+        ray.get(
+            [
+                worker.create_transferqueue_client.remote(controller_infos, storage_infos, role)
+                for worker in self.agent_loop_workers
+            ]
+        )
diff --git a/recipe/transfer_queue/config/transfer_queue_ppo_trainer.yaml b/recipe/transfer_queue/config/transfer_queue_ppo_trainer.yaml
@@ -0,0 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
+# config for TransferQueue
+transfer_queue:
+  enable: True