Skip to content

cmriat/l0

Repository files navigation

L0: Reinforcement Learning to Become General Agents

Ask DeepWiki.com Paper Hugging Face ηŸ₯乎

πŸ€– A scalable, end-to-end training pipeline for general-purpose agents


🎯 Overview

L0 is a scalable, end-to-end training pipeline for general-purpose agents. We provide you:

  • A RL training framework for complex agentic environments, featuring a low-cost, extensible, and sandboxed concurrent agent worker pool.
  • A generalist agentic scaffold Notebook Agent (NB-Agent) operates in a "code-as-action" fashion via a Read-Eval-Print-Loop (REPL) with Jupyter Kernel.
  • A simple yet effective agentic multi-turn training recipe with agentic policy gradient and verifiable multi-step rewards.
  • A series of models trained with L0, including L0-4B (Qwen 3), L0-7B (Qwen2.5), and L0-32B (Qwen2.5). We claim that these models are capable of general agentic tasks, a case on deep searcher scenario using L0-32B (Qwen2.5) is provided in the examples.

πŸ“‹ Table of Contents

πŸ† Key Results

We significantly improved the model's performance on multiple benchmarks using L0: Improvement

  • Model name + NB-Agent indicates directly evaluate the model with NB-Agent without training.

L0 also gained competitive performance compared with other works: Comparison

  • All compared with 7B models

🧠 Tech Details

Training pipeline

Training Pipeline

Algorithm

  • Agentic Policy Gradient: Optimizes policy gradient for agents, treats a complete "think-code" sequence as a single action
  • Verifiable Reward Function: Provides multi-faceted rewards for answer correctness, format compliance, and code execution
  • Strict On-Policy Training: Uses a pure on-policy approach with a KL-divergence penalty for stable learning
  • DAPO-Inspired Rejection Sampling: Advanced rejection sampling strategy for improved policy optimization

Infrastructure

  • Decoupled Architecture: Separates CPU agent workers from a GPU inference server for independent scaling
  • Flexible server-client architecture: Scalable agent task execution with FastAPI-based orchestration, you could refer to the trajectory sampler design document for more details
  • Lightweight Sandboxing: Uses Bubblewrap for secure, low-overhead, and parallel agent environments

NB-Agent

NB-Agent is designed to be a general-purpose agent following "Code-as-Action" paradigm. Moreover, it works in a REPL, which allows agents to interact with environments by generating code snippets that are executed in a Jupyter Notebook environment.

NBAgent Architecture

πŸš€ Quick Start

Installation:

# Clone this repository
git clone --recurse-submodules https://github.com/cmriat/l0.git
cd l0

We use Pixi for package management.

  • Pixi is a fast, reliable, and cross-platform package manager for Python and other languages. Visit pixi.sh to learn more and install it.
# Install Pixi if you haven't already
# curl -fsSL https://pixi.sh/install.sh | bash

# Bypass the default CUDA version check 
# export CONDA_OVERRIDE_CUDA=12.9 

# Enter the environment
pixi shell

Example: Training NB-Agent with REINFORCE++

This example demonstrates training a NB-Agent using the REINFORCE++ algorithm on QA datasets.

Prerequisites

1. Prepare dataset

python examples/data_preprocess/l0_qa.py --local_dir ./dataset

2. Start Agent Execution Manager Server

On the remote machines (1 or more, only consume CPUs):

bash examples/start_task_server.sh

3. Configure Remote Server URLs

Edit the training script to specify remote server URLs:

# File: examples/nb_agent_training/train_qa_reinforcepp*.sh
# Line: actor_rollout_ref.traj_sampler.remote_exec_server_url=['http://IP1:8000', 'http://IP2:8000', 'http://IP3:8000']

4. API Keys Configuration

Some tools of NB-Agent require API keys for external services. The following services are required in QA training:

  • Content Processing: Jina (required)
  • Search Services: At least one of Exa, Firecrawl, or Serper (choose one or more, Serper is recommended one, if you are using search tool other than serper, please modify TOOL_FACTORY_MAP in tool specs factory tool).

Create a .env file in the root directory with the configurations of dependent services. You can use the .env.example file as a template:

cp .env.example .env

Then, edit the .env file to add your API keys.

5. Running in Container

Since L0 uses bubblewrap to isolate environments of agent rollouts, if you want to run it in a container, you need to give your container the following capabilities:

  • security-opt: apparmor=unconfined
  • CAPABILITY: ALL

Or you could use --privileged to give the container all capabilities, which is not recommended for security reasons.

Training Scenarios

⚠️ Important: Start the remote task server first before running any training scripts. The task server can be launched locally or on a remote CPU-only machine. Ensure that both the training script and task server run under the same environment and workspace (this may require a distributed file system to share path between instances).

Choose Your Training Configuration

Select the appropriate training script based on your hardware setup and model size requirements.

1. Single Node Training

For single-node setups with limited GPU resources:

  • 0.6B Model (Qwen3-0.6B)

    • Hardware Requirements: 1 node, 1 GPU
    • Command:
      bash examples/nb_agent_training/train_qa_reinforcepp_dynamic_0_6b.sh
  • 4B Model (Qwen3-4B)

    • Hardware Requirements: 1 node, 8 GPUs
    • Command:
      bash examples/nb_agent_training/train_qa_reinforcepp_dynamic_4b.sh

2. Multi-Node Training

For larger models requiring distributed training, you need to set up a Ray cluster first:

Step 2.1: Launch Ray Cluster

# On the head node:
ray start --head --dashboard-host=0.0.0.0

# On worker nodes:
ray start --address=YOUR_HEAD_NODE_IP:6379

Step 2.2: Submit Training Jobs

  • 7B Model (Qwen2.5-7B-Instruct)

    • Hardware Requirements: 2 nodes, 16 GPUs
    • Command:
      RAY_ADDRESS=YOUR_HEAD_NODE_IP:8265 ray job submit examples/nb_agent_training/train_qa_reinforcepp_dynamic_7b.sh
  • 32B Model (Qwen2.5-32B-Instruct)

    • Hardware Requirements: 8 nodes, 64 GPUs
    • Command:
      RAY_ADDRESS=YOUR_HEAD_NODE_IP:8265 ray job submit examples/nb_agent_training/train_qa_reinforcepp_dynamic_32b.sh

Using NB-Agent Scaffold Alone

  • For ease of use, we have packaged NB-Agent. You can install and use it separately via pixi install nbagent.
  • In our tests, existing frontier models like Gemini and Claude have demonstrated powerful capabilities under NB-Agent without training.
  • You could refer to the NB-Agent Example for a deep searcher example using NB-Agent.

Model conversion

We directly adapt conversion scripts from verl. It's under examples/model_converter.py. Please refer to the verl model converter document for usage after training.

Serving your own model

Since NB-Agent needs to use the tokenizer of the model, we patch SGLang to provide extra endpoints. You could refer to the example patched SGLang server document for launching a patched SGLang server.

πŸ“Š Evaluation

We provide an evaluation suite for QA datasets with an agent worker pool to parallelize sampling. You could refer to the evaluation document for more details.

πŸ—‚οΈ Data preprocessing

We provide a data preprocessing pipeline for QA datasets, which includes downloading, merging, quality assessment, and filtering. You could refer to the data preprocessing document for more details.

πŸ› οΈ Development

For developing and testing

# Install development dependencies
pixi install --env dev

# Enter the development environment
pixi shell -e dev

# Install pre-commit hooks
pre-commit install

# Running Tests
pytest ./tests

⚠️ Known Issues

  • If you encounter Out of Memory (OOM) errors during SGLang server capture CUDA graph, you could try to launch ray cluster first and then submit your training script. You can refer to the multi-node training section. It also works for single node training.
  • If training hangs at the update_weight_from_tensors step in the SGLang server, please try restarting the process or adjusting the tensor parallel size. We will work with the SGLang team and verl team to locate and resolve this issue.

πŸ™ Acknowledgments

  • This project adapts code from the verl and the SGLang. We are grateful for their contribution to the open-source community.
  • Thanks to Open-Reasoner-Zero and DAPO for sharing their training techniques and insights.
  • Special thanks to the Pixi team for their excellent package management tool, which greatly simplifies our development process.

About

A scalable, end-to-end training pipeline for general-purpose agents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages