L0 is a scalable, end-to-end training pipeline for general-purpose agents. We provide you:
- A RL training framework for complex agentic environments, featuring a low-cost, extensible, and sandboxed concurrent agent worker pool.
- A generalist agentic scaffold Notebook Agent (NB-Agent) operates in a "code-as-action" fashion via a Read-Eval-Print-Loop (REPL) with Jupyter Kernel.
- A simple yet effective agentic multi-turn training recipe with agentic policy gradient and verifiable multi-step rewards.
- A series of models trained with L0, including L0-4B (Qwen 3), L0-7B (Qwen2.5), and L0-32B (Qwen2.5). We claim that these models are capable of general agentic tasks, a case on deep searcher scenario using L0-32B (Qwen2.5) is provided in the examples.
- L0: Reinforcement Learning to Become General Agents
We significantly improved the model's performance on multiple benchmarks using L0:
- Model name + NB-Agent indicates directly evaluate the model with NB-Agent without training.
L0 also gained competitive performance compared with other works:
- All compared with 7B models
- Agentic Policy Gradient: Optimizes policy gradient for agents, treats a complete "think-code" sequence as a single action
- Verifiable Reward Function: Provides multi-faceted rewards for answer correctness, format compliance, and code execution
- Strict On-Policy Training: Uses a pure on-policy approach with a KL-divergence penalty for stable learning
- DAPO-Inspired Rejection Sampling: Advanced rejection sampling strategy for improved policy optimization
- Decoupled Architecture: Separates CPU agent workers from a GPU inference server for independent scaling
- Flexible server-client architecture: Scalable agent task execution with FastAPI-based orchestration, you could refer to the trajectory sampler design document for more details
- Lightweight Sandboxing: Uses Bubblewrap for secure, low-overhead, and parallel agent environments
NB-Agent is designed to be a general-purpose agent following "Code-as-Action" paradigm. Moreover, it works in a REPL, which allows agents to interact with environments by generating code snippets that are executed in a Jupyter Notebook environment.
- You could refer to the NB-Agent Documentation for more details on the design and architecture of NB-Agent.
# Clone this repository
git clone --recurse-submodules https://github.com/cmriat/l0.git
cd l0
We use Pixi for package management.
- Pixi is a fast, reliable, and cross-platform package manager for Python and other languages. Visit pixi.sh to learn more and install it.
# Install Pixi if you haven't already
# curl -fsSL https://pixi.sh/install.sh | bash
# Bypass the default CUDA version check
# export CONDA_OVERRIDE_CUDA=12.9
# Enter the environment
pixi shell
This example demonstrates training a NB-Agent using the REINFORCE++ algorithm on QA datasets.
1. Prepare dataset
python examples/data_preprocess/l0_qa.py --local_dir ./dataset
2. Start Agent Execution Manager Server
On the remote machines (1 or more, only consume CPUs):
bash examples/start_task_server.sh
3. Configure Remote Server URLs
Edit the training script to specify remote server URLs:
# File: examples/nb_agent_training/train_qa_reinforcepp*.sh
# Line: actor_rollout_ref.traj_sampler.remote_exec_server_url=['http://IP1:8000', 'http://IP2:8000', 'http://IP3:8000']
4. API Keys Configuration
Some tools of NB-Agent require API keys for external services. The following services are required in QA training:
- Content Processing: Jina (required)
- Search Services: At least one of Exa, Firecrawl, or Serper (choose one or more, Serper is recommended one, if you are using search tool other than serper, please modify
TOOL_FACTORY_MAP
in tool specs factory tool).
Create a .env
file in the root directory with the configurations of dependent services. You can use the .env.example
file as a template:
cp .env.example .env
Then, edit the .env
file to add your API keys.
5. Running in Container
Since L0 uses bubblewrap to isolate environments of agent rollouts, if you want to run it in a container, you need to give your container the following capabilities:
security-opt
:apparmor=unconfined
CAPABILITY
:ALL
Or you could use --privileged
to give the container all capabilities, which is not recommended for security reasons.
Choose Your Training Configuration
Select the appropriate training script based on your hardware setup and model size requirements.
1. Single Node Training
For single-node setups with limited GPU resources:
-
0.6B Model (Qwen3-0.6B)
- Hardware Requirements: 1 node, 1 GPU
- Command:
bash examples/nb_agent_training/train_qa_reinforcepp_dynamic_0_6b.sh
-
4B Model (Qwen3-4B)
- Hardware Requirements: 1 node, 8 GPUs
- Command:
bash examples/nb_agent_training/train_qa_reinforcepp_dynamic_4b.sh
2. Multi-Node Training
For larger models requiring distributed training, you need to set up a Ray cluster first:
Step 2.1: Launch Ray Cluster
# On the head node:
ray start --head --dashboard-host=0.0.0.0
# On worker nodes:
ray start --address=YOUR_HEAD_NODE_IP:6379
Step 2.2: Submit Training Jobs
-
7B Model (Qwen2.5-7B-Instruct)
- Hardware Requirements: 2 nodes, 16 GPUs
- Command:
RAY_ADDRESS=YOUR_HEAD_NODE_IP:8265 ray job submit examples/nb_agent_training/train_qa_reinforcepp_dynamic_7b.sh
-
32B Model (Qwen2.5-32B-Instruct)
- Hardware Requirements: 8 nodes, 64 GPUs
- Command:
RAY_ADDRESS=YOUR_HEAD_NODE_IP:8265 ray job submit examples/nb_agent_training/train_qa_reinforcepp_dynamic_32b.sh
- For ease of use, we have packaged NB-Agent. You can install and use it separately via
pixi install nbagent
. - In our tests, existing frontier models like Gemini and Claude have demonstrated powerful capabilities under NB-Agent without training.
- You could refer to the NB-Agent Example for a deep searcher example using NB-Agent.
We directly adapt conversion scripts from verl. It's under examples/model_converter.py
. Please refer to the verl model converter document for usage after training.
Since NB-Agent
needs to use the tokenizer of the model, we patch SGLang to provide extra endpoints. You could refer to the example patched SGLang server document for launching a patched SGLang server.
We provide an evaluation suite for QA datasets with an agent worker pool to parallelize sampling. You could refer to the evaluation document for more details.
We provide a data preprocessing pipeline for QA datasets, which includes downloading, merging, quality assessment, and filtering. You could refer to the data preprocessing document for more details.
# Install development dependencies
pixi install --env dev
# Enter the development environment
pixi shell -e dev
# Install pre-commit hooks
pre-commit install
# Running Tests
pytest ./tests
- If you encounter Out of Memory (OOM) errors during SGLang server capture CUDA graph, you could try to launch ray cluster first and then submit your training script. You can refer to the multi-node training section. It also works for single node training.
- If training hangs at the update_weight_from_tensors step in the SGLang server, please try restarting the process or adjusting the tensor parallel size. We will work with the SGLang team and verl team to locate and resolve this issue.
- This project adapts code from the verl and the SGLang. We are grateful for their contribution to the open-source community.
- Thanks to Open-Reasoner-Zero and DAPO for sharing their training techniques and insights.
- Special thanks to the Pixi team for their excellent package management tool, which greatly simplifies our development process.