Learning to Reason without External Rewards

Intuitor is a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm we call Reinforcement Learning from Internal Feedback (RLIF).

🧭 What is RLIF?

Reinforcement Learning from Internal Feedback (RLIF) is a training framework where language models learn without any external rewards, gold labels, or verifiers. Instead, models improve by optimizing intrinsic signals—such as confidence in their own answers—generated entirely from within. RLIF enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable.

Intuitor instantiates RLIF by using self-certainty—a model's confidence measured via KL divergence to uniform—as an intrinsic reward in the GRPO policy optimization algorithm.

🚀 Update [2025-06-16]

We have released four model checkpoints trained on the MATH dataset for one epoch. You're welcome to try out the models and evaluate their performance!

View Model Collections

Model Name	Size	Method	Hugging Face Link
`sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH`	1.5B	Intuitor	View Model
`sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH`	3B	Intuitor	View Model
`sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH`	7B	Intuitor	View Model
`sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH`	14B	Intuitor	View Model
`sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH`	1.5B	GRPO	View Model
`sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH`	3B	GRPO	View Model
`sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH`	7B	GRPO	View Model
`sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH`	14B	GRPO	View Model

📦 Repository Structure

This repository contains two self-contained implementations of Intuitor:

open-r1-intuitor: Based on Hugging Face’s Open-R1, reproducing DeepSeek-R1 in a fully open-source fashion.

↳ Built on commit ebd5913

verl-intuitor: Based on VERL, a high-performance RL training library designed for LLMs.

↳ Built on commit 40dcabe

Both are licensed under Apache 2.0 and include their respective LICENSE and NOTICE files.

🛠️ Getting Started

Firstly, cd into the desired variant folder and set up the enviornment as specified in the README.md file of that variant. Then follow the instructions below to run the example training script.

open-r1-intuitor

Training on MATH Dataset

Modify the WANDB_KEY in the run_intuitor.sh script to your own WANDB key, then run the following command:

bash run_intuitor.sh

To facilitate future research, we have enabled combining self-certainty with other reward signals. If reward weights are not set to 0, self-certainty and other rewards will first be normalized separately, then added together.

Training on Codecontests Dataset

First, download the code corpora and prepare the dataset using the following Python script:

python scripts/code_process.py

Modify the WANDB_KEY in the run_intuitor_code.sh script to your own WANDB key, then run the following command:

bash run_intuitor_code.sh

verl-intuitor

First, download the MATH dataset and prepare it using the following Python script:

python examples/data_preprocess/math_dataset_ours.py --model Qwen2.5-3B

Then, run the following command to start the training (Modify the WANDB_KEY in the math_intuitor.sh script to your own WANDB key.):

bash math_intuitor.sh

Note: The only heuristic in Intuitor is the prompt used to query the model. As a result, performance can sometimes be sensitive to prompt design. If the model does not appear to learn effectively, we recommend trying alternative prompts or using the original prompt provided in our setup.

📊 Benchmarks

Intuitor achieves:

Comparable performance to GRPO on in-domain math reasoning tasks (GSM8K, MATH500)
Superior generalization to code generation (LiveCodeBench, CRUXEval)
Improved instruction following, without needing any gold labels or verifiable test suites

For detailed results, see Table 1 in the paper.

📚 References

This project builds upon the following open-source repositories:

open-r1 License: Apache License 2.0
verl License: Apache License 2.0

📄 Citation

If you use Intuitor in your research, please cite our paper:

@article{zhao2025learning,
  title={Learning to Reason without External Rewards},
  author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal={arXiv preprint arXiv:2505.19590},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning to Reason without External Rewards

🧭 What is RLIF?

🚀 Update [2025-06-16]

📦 Repository Structure

🛠️ Getting Started

open-r1-intuitor

Training on MATH Dataset

Training on Codecontests Dataset

verl-intuitor

📊 Benchmarks

📚 References

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
figs		figs
open-r1-intuitor		open-r1-intuitor
verl-intuitor		verl-intuitor
.gitignore		.gitignore
README.md		README.md

sunblaze-ucb/Intuitor

Folders and files

Latest commit

History

Repository files navigation

Learning to Reason without External Rewards

🧭 What is RLIF?

🚀 Update [2025-06-16]

📦 Repository Structure

🛠️ Getting Started

open-r1-intuitor

Training on MATH Dataset

Training on Codecontests Dataset

verl-intuitor

📊 Benchmarks

📚 References

📄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages