L0: Reinforcement Learning to Become General Agents

🤖 A scalable, end-to-end training pipeline for general-purpose agents

🎯 Overview

L0 is a scalable, end-to-end training pipeline for general-purpose agents. We provide you:

A RL training framework for complex agentic environments, featuring a low-cost, extensible, and sandboxed concurrent agent worker pool.
A generalist agentic scaffold Notebook Agent (NB-Agent) operates in a "code-as-action" fashion via a Read-Eval-Print-Loop (REPL) with Jupyter Kernel.
A simple yet effective agentic multi-turn training recipe with agentic policy gradient and verifiable multi-step rewards.
A series of models trained with L0, including L0-4B (Qwen 3), L0-7B (Qwen2.5), and L0-32B (Qwen2.5). We claim that these models are capable of general agentic tasks, a case on deep searcher scenario using L0-32B (Qwen2.5) is provided in the examples.

📋 Table of Contents

L0: Reinforcement Learning to Become General Agents

🏆 Key Results

We significantly improved the model's performance on multiple benchmarks using L0:

Model name + NB-Agent indicates directly evaluate the model with NB-Agent without training.

L0 also gained competitive performance compared with other works:

All compared with 7B models

🧠 Tech Details

Training pipeline

Algorithm

Agentic Policy Gradient: Optimizes policy gradient for agents, treats a complete "think-code" sequence as a single action
Verifiable Reward Function: Provides multi-faceted rewards for answer correctness, format compliance, and code execution
Strict On-Policy Training: Uses a pure on-policy approach with a KL-divergence penalty for stable learning
DAPO-Inspired Rejection Sampling: Advanced rejection sampling strategy for improved policy optimization

Infrastructure

Decoupled Architecture: Separates CPU agent workers from a GPU inference server for independent scaling
Flexible server-client architecture: Scalable agent task execution with FastAPI-based orchestration, you could refer to the trajectory sampler design document for more details
Lightweight Sandboxing: Uses Bubblewrap for secure, low-overhead, and parallel agent environments

NB-Agent

NB-Agent is designed to be a general-purpose agent following "Code-as-Action" paradigm. Moreover, it works in a REPL, which allows agents to interact with environments by generating code snippets that are executed in a Jupyter Notebook environment.

You could refer to the NB-Agent Documentation for more details on the design and architecture of NB-Agent.

🚀 Quick Start

Installation:

# Clone this repository
git clone --recurse-submodules https://github.com/cmriat/l0.git
cd l0

We use Pixi for package management.

Pixi is a fast, reliable, and cross-platform package manager for Python and other languages. Visit pixi.sh to learn more and install it.

# Install Pixi if you haven't already
# curl -fsSL https://pixi.sh/install.sh | bash

# Bypass the default CUDA version check 
# export CONDA_OVERRIDE_CUDA=12.9 

# Enter the environment
pixi shell

Example: Training NB-Agent with REINFORCE++

This example demonstrates training a NB-Agent using the REINFORCE++ algorithm on QA datasets.

Prerequisites

1. Prepare dataset

python examples/data_preprocess/l0_qa.py --local_dir ./dataset

2. Start Agent Execution Manager Server

On the remote machines (1 or more, only consume CPUs):

bash examples/start_task_server.sh

3. Configure Remote Server URLs

Edit the training script to specify remote server URLs:

# File: examples/nb_agent_training/train_qa_reinforcepp*.sh
# Line: actor_rollout_ref.traj_sampler.remote_exec_server_url=['http://IP1:8000', 'http://IP2:8000', 'http://IP3:8000']

4. API Keys Configuration

Some tools of NB-Agent require API keys for external services. The following services are required in QA training:

Content Processing: Jina (required)
Search Services: At least one of Exa, Firecrawl, or Serper (choose one or more, Serper is recommended one, if you are using search tool other than serper, please modify TOOL_FACTORY_MAP in tool specs factory tool).

Create a .env file in the root directory with the configurations of dependent services. You can use the .env.example file as a template:

cp .env.example .env

Then, edit the .env file to add your API keys.

5. Running in Container

Since L0 uses bubblewrap to isolate environments of agent rollouts, if you want to run it in a container, you need to give your container the following capabilities:

security-opt: apparmor=unconfined
CAPABILITY: ALL

Or you could use --privileged to give the container all capabilities, which is not recommended for security reasons.

Training Scenarios

⚠️ Important: Start the remote task server first before running any training scripts. The task server can be launched locally or on a remote CPU-only machine. Ensure that both the training script and task server run under the same environment and workspace (this may require a distributed file system to share path between instances).

Choose Your Training Configuration

Select the appropriate training script based on your hardware setup and model size requirements.

1. Single Node Training

For single-node setups with limited GPU resources:

0.6B Model (Qwen3-0.6B)

Hardware Requirements: 1 node, 1 GPU

Command:

bash examples/nb_agent_training/train_qa_reinforcepp_dynamic_0_6b.sh

4B Model (Qwen3-4B)

Hardware Requirements: 1 node, 8 GPUs

Command:

bash examples/nb_agent_training/train_qa_reinforcepp_dynamic_4b.sh

2. Multi-Node Training

For larger models requiring distributed training, you need to set up a Ray cluster first:

Step 2.1: Launch Ray Cluster

# On the head node:
ray start --head --dashboard-host=0.0.0.0

# On worker nodes:
ray start --address=YOUR_HEAD_NODE_IP:6379

Step 2.2: Submit Training Jobs

7B Model (Qwen2.5-7B-Instruct)

Hardware Requirements: 2 nodes, 16 GPUs

Command:

RAY_ADDRESS=YOUR_HEAD_NODE_IP:8265 ray job submit examples/nb_agent_training/train_qa_reinforcepp_dynamic_7b.sh

32B Model (Qwen2.5-32B-Instruct)

Hardware Requirements: 8 nodes, 64 GPUs

Command:

RAY_ADDRESS=YOUR_HEAD_NODE_IP:8265 ray job submit examples/nb_agent_training/train_qa_reinforcepp_dynamic_32b.sh

Using `NB-Agent` Scaffold Alone

For ease of use, we have packaged NB-Agent. You can install and use it separately via pixi install nbagent.
In our tests, existing frontier models like Gemini and Claude have demonstrated powerful capabilities under NB-Agent without training.
You could refer to the NB-Agent Example for a deep searcher example using NB-Agent.

Model conversion

We directly adapt conversion scripts from verl. It's under examples/model_converter.py. Please refer to the verl model converter document for usage after training.

Serving your own model

Since NB-Agent needs to use the tokenizer of the model, we patch SGLang to provide extra endpoints. You could refer to the example patched SGLang server document for launching a patched SGLang server.

📊 Evaluation

We provide an evaluation suite for QA datasets with an agent worker pool to parallelize sampling. You could refer to the evaluation document for more details.

🗂️ Data preprocessing

We provide a data preprocessing pipeline for QA datasets, which includes downloading, merging, quality assessment, and filtering. You could refer to the data preprocessing document for more details.

🛠️ Development

For developing and testing

# Install development dependencies
pixi install --env dev

# Enter the development environment
pixi shell -e dev

# Install pre-commit hooks
pre-commit install

# Running Tests
pytest ./tests

⚠️ Known Issues

If you encounter Out of Memory (OOM) errors during SGLang server capture CUDA graph, you could try to launch ray cluster first and then submit your training script. You can refer to the multi-node training section. It also works for single node training.
If training hangs at the update_weight_from_tensors step in the SGLang server, please try restarting the process or adjusting the tensor parallel size. We will work with the SGLang team and verl team to locate and resolve this issue.

🙏 Acknowledgments

This project adapts code from the verl and the SGLang. We are grateful for their contribution to the open-source community.
Thanks to Open-Reasoner-Zero and DAPO for sharing their training techniques and insights.
Special thanks to the Pixi team for their excellent package management tool, which greatly simplifies our development process.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github		.github
assets		assets
data		data
docs/design		docs/design
evaluation/nb_agent_eval		evaluation/nb_agent_eval
examples		examples
external		external
papers		papers
src/l0		src/l0
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

L0: Reinforcement Learning to Become General Agents

🎯 Overview

📋 Table of Contents

🏆 Key Results

🧠 Tech Details

Training pipeline

Algorithm

Infrastructure

NB-Agent

🚀 Quick Start

Installation:

Example: Training NB-Agent with REINFORCE++

Prerequisites

Training Scenarios

Using `NB-Agent` Scaffold Alone

Model conversion

Serving your own model

📊 Evaluation

🗂️ Data preprocessing

🛠️ Development

For developing and testing

⚠️ Known Issues

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 9

Languages

License

cmriat/l0

Folders and files

Latest commit

History

Repository files navigation

L0: Reinforcement Learning to Become General Agents

🎯 Overview

📋 Table of Contents

🏆 Key Results

🧠 Tech Details

Training pipeline

Algorithm

Infrastructure

NB-Agent

🚀 Quick Start

Installation:

Example: Training NB-Agent with REINFORCE++

Prerequisites

Training Scenarios

Using NB-Agent Scaffold Alone

Model conversion

Serving your own model

📊 Evaluation

🗂️ Data preprocessing

🛠️ Development

For developing and testing

⚠️ Known Issues

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Using `NB-Agent` Scaffold Alone

Packages