 
ROLL is an efficient and user-friendly RL library designed for Large Language Models (LLMs) utilizing Large Scale GPU resources. It significantly enhances LLM performance in key areas such as human preference alignment, complex reasoning, and multi-turn agentic interaction scenarios.
Leveraging a multi-role distributed architecture with Ray for flexible resource allocation and heterogeneous task scheduling, ROLL integrates cutting-edge technologies like Megatron-Core, SGLang and vLLM to accelerate model training and inference.
| ๐ฃ Updates | 
|---|
| [10/23/2025] ๐ Our Papers released, see Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning and Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization. | 
| [10/14/2025] ๐ Our Paper released, see Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony, the code will be released soon. | 
| [09/28/2025] ๐ Ascend NPU support โ see usage guide. | 
| [09/25/2025] ๐ Our Paper released, see RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training | 
| [09/24/2025] ๐ Support Wan2_2 Reward FL pipeline. Explore the new capabilities! | 
| [09/23/2025] ๐ ROLL aligns with GEM environment definition, providing agentic Tool Use training capabilities, ToolUse docs. | 
| [09/16/2025] ๐ Qwen3-Next model training is supported, refer to configuration. | 
| [09/04/2025] ๐ ROLL supports vLLM dynamic FP8 rollout and remove_padding for acceleration. | 
| [08/28/2025] ๐ ROLL supports SFT pipeline, refer to configuration. | 
| [08/13/2025] ๐ ROLL supports AMD GPUs with out-of-box image docker and Dockerfile and specific yamls under examples/directory. Please refer to Installation. | 
| [08/11/2025] ๐ Our Paper released, see Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning. | 
| [08/10/2025] ๐ Agentic RL supports stepwise learning, like GiGPO; Distill supports VLM. Explore the new capabilities! | 
| [08/06/2025] ๐ ROLL PPT is now available, Slides. | 
| [07/31/2025] ๐ Refactor agentic rl design. Support agentic rl async training. Explore the new capabilities! | 
| [07/31/2025] ๐ Support DistillPipeline/DpoPipeline. Support lora. Support GSPO | 
| [06/25/2025] ๐ Support thread env for env scaling and support qwen2.5 VL agentic pipeline. | 
| [06/13/2025] ๐ Support Qwen2.5 VL rlvr pipeline and upgrade mcore to 0.12 version. | 
| [06/09/2025] ๐ ROLL tech report is now available! Access the report here. | 
| [06/08/2025] ๐Supports Qwen3(8B/14B/32B), Qwen3-MoE(30A3/235A22), Qwen2.5(7B/14B/32B/72B) LLM models. | 
| [05/30/2025] ๐ Training RLVR and Agentic RL with ROLL is now available! Explore the new capabilities. | 
Installation
Config System Explanation
Debugging Guide
Trackers and Metrics
Checkpoint Saving and Resuming Guide
Converting MCoreAdapter Models to Hugging Face Format
Quick Start: Single-Node Deployment Guide
Quick Start: Multi-Node Deployment Guide
Frequently Asked Questions
RLVR Pipeline
Agentic Pipeline
Agentic Comprehensive Guide
Distill Pipeline
Reinforce++
TOPR
GiGPO
PPO
Lite PPO
GRPO
GSPO
RAFT++
StarPO
RewardFL
Agentic Asynchronous Parallel Rollout
Agentic Asynchronous Training Feature
Resource Config
GPU Time-Division Multiplexing Control
- Multi-task RL Training (RLVR): Covers mathematics, coding, general reasoning, open-ended Q&A, instruction following, etc.
- Flexible domain_batch_sizedistribution control.
- Sample-level asynchronous parallel Rollout, asynchronous reward calculation, and dynamic sampling.
- Asynchronous training under implementation.
 
- Flexible 
- Agentic RL: Multi-turn interaction capabilities for games, multi-turn dialogues, tool use, etc.
- Environment-level asynchronous parallel rollout.
- Supports asynchronous training.
- Multi-turn interaction rollout supports local debugging, improving multi-turn interaction business development efficiency.
- Supports TrajectoryWise (StartPO) and StepWise (GiGPO) training paradigms.
 
- Algorithm-Friendly: Provides flexible and rich RL strategy configurations by default.
- Over 20 rich reinforcement learning strategy options, such as reward normalization, reward clipping, various advantage estimation methods, etc.
- Out-of-the-box support for reinforcement learning algorithms, such as PPO, GRPO, Reinforce++, TOPR, RAFT++, GSPO, etc.
 
- Rich Training and Inference Engine: Ray-based multi-role distributed architecture; Strategy abstraction unifies various backends, enabling easy operation from single machines to thousands-of-GPU clusters.
- Inference/Generation supports vLLM, SGLang.
- Training supports DeepSpeed (ZeRO), Megatron-LM 5D parallelism (mcore-adapter, dp/tp/pp/cp/ep), FSDP under implementation.
- Extreme offload/reload capabilities.
- Supports LoRA training.
- Supports FP8 rollout (FP8 inference for LLM as judge, FP8 rollout with BF16 training under development).
 
- AutoDeviceMapping: Supports custom device mapping for different roles, flexibly managing colocated and disaggregated deployments.
- Observability: Integrated with SwanLab / WandB / TensorBoard, tracking of performance for each domain and reward type.
- Rich Post-training Technical Support:
- Agentic RL LLM & VLM
- RLVR LLM & VLM
- Distill Pipeline LLM & VLM
- DPO Pipeline
- SFT Pipeline under development
 
We are continuously working to expand ROLL's capabilities:
- โฑ๏ธ Async RLVR pipeline: For even more efficient and streamlined asynchronous operations.
- โ๏ธ FSDP2: Integrating the latest Fully Sharded Data Parallel techniques.
- ๐ Support DeepseekV3: Adding compatibility for the newest Deepseek models.
- IPRO: A novel video diffusion framework using reinforcement learning to enhance identity preservation in human-centric I2V generation, optimizing diffusion models with face identity scorer and KL-divergence regularization.
- TaoSR-SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for Taobao Search Relevance, with SRPO (hybrid reward model + offline verifier), diversified data filtering, and multi-stage curriculum learning.
- EARL: Efficient Agentic RL Systems for LLMs, introducing a dynamic parallelism selector and a layout-aware data dispatcher to boost throughput, reduce memory and data movement bottlenecks, enabling stable large-scale agentic RL without hard context-length limits.
- LiveThinking: Real-time reasoning for AI-powered livestreaming by distilling a 670B teacher LLM to a 30B MoE (3B active) via Rejection Sampling Fine-Tuning, then compressing reasoning with GRPO; delivers sub-second latency and ~30x compute reduction, with gains in response correctness (3.3%), helpfulness (21.8%), and GMV in Taobao Live Digital Live Service.
- TaoSR-AGRL: Adaptive Guided Reinforcement Learning for LLM-based e-commerce relevance, introducing Rule-aware Reward Shaping and Adaptive Guided Replay to improve long-horizon reasoning, rule adherence, and training stability in Taobao Search; deployed in main search handling hundreds of millions of users.
- RecGPT: a next-generation, LLM-driven framework that places user intent at the core of recommender systems, fostering a more sustainable and mutually beneficial ecosystem.
- TaoSR1: A novel LLM framework directly deploying Chain-of-Thought (CoT) reasoning for e-commerce query-product relevance prediction, overcoming deployment challenges for superior performance.
- AIGB-Pearl: a novel auto-bidding method that integrates generative planning and policy optimization, utilizing an LLM-enhanced trajectory evaluator to iteratively refine bidding strategies for state-of-the-art advertising performance.
ROLL is inspired by the design of OpenRLHF, VeRL, Nemo-Aligner, and RAGEN.
The project is developed by Alibaba TAOBAO & TMALL Group and Alibaba Group. The code is distributed under the Apache License (Version 2.0). This product contains various third-party components under other open-source licenses. See the NOTICE file for more information.
The following repositories have been used in ROLL, either in their close-to-original form or as an inspiration:
- NVIDIA/Megatron-LM
- microsoft/DeepSpeed
- sgl-project/sglang
- vllm-project/vllm
- modelscope/DiffSynth-Studio
If you use ROLL in your research or project, please consider citing us:
@article{wang2025reinforcement,
  title={Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library},
  author={Wang, Weixun and Xiong, Shaopan and Chen, Gengru and Gao, Wei and Guo, Sheng and He, Yancheng and Huang, Ju and Liu, Jiaheng and Li, Zhendong and Li, Xiaoyang and others},
  journal={arXiv preprint arXiv:2506.06122},
  year={2025}
}ROLL is a project jointly developed by Taotian Future Life Lab and Aicheng Technology, with a strong emphasis on pioneering the future of Reinforcement Learning (RL). Our mission is to explore and shape innovative forms of future living powered by advanced RL technologies. If you are passionate about the future of RL and want to be part of its evolution, we warmly welcome you to join us! Learn more about the ROLL Team through our official channels below๐
We are HIRING!
- Post Training Infra ็ ๅๅทฅ็จๅธ JD link
- ๅคงๆจกๅ่ฎญ็ปไธๅฎถ๏ผ
- Infra ็ ็ฉถๅๅฎไน ็ JD link