Jupyter Agent is an open-source data science agent that lives inside your Jupyter notebook. It can:
- Read notebook + dataset context
- Execute Python code (
pandas
,numpy
,matplotlib
, …) - Produce step-by-step reasoning traces with intermediate computations
👉 Think of it as Cursor, but built natively for data analysis workflows.
📖 Learn more in our blog post or try the live demo.
We release:
-
Dataset: Jupyter Agent Dataset (51k synthetic notebooks, ~0.2B tokens)
-
Models:
-
Pipeline: Code to generate training data from Kaggle notebooks + fine-tuning scripts
- Jupyter notebooks are the de facto environment for scientists and analysts.
- We built a dataset + training pipeline that helps small models become strong data agents.
- On the DABStep benchmark, our tuned 4B model reaches SOTA performance for its size on realistic data science tasks.
Our pipeline processes the Meta Kaggle Notebooks dataset (2TB) into training-ready data:
- Deduplicate notebooks (~90% duplicates)
- Fetch linked datasets for executability
- Score notebooks for educational quality
- Filter irrelevant content
- Generate dataset-grounded QA pairs
- Produce reasoning + execution traces
- Curate final dataset (~2B tokens)
Clone the repo:
git clone https://github.com/huggingface/jupyter-agent.git
cd jupyter-agent
- To generate the dataset, check the
data/
folder. - To fine-tune the model, check the
finetuning/
folder.
from datasets import load_dataset
ds = load_dataset("data-agents/jupyter-agent-dataset", split="non-thinking")
from transformers import AutoModelForCausalLM, AutoTokenizer
model = "data-agents/jupyter-agent-qwen3-4b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model, torch_dtype="auto", device_map="auto")
- Base Qwen3-4B-Instruct (easy split): 38.7%
- With scaffolding: 52.8%
- After fine-tuning on our dataset: 75%
Our fine-tuned model is the current SOTA small-model agent on DABStep.
- Blog post – full story + insights
- Dataset on Hub
- Models on Hub
- DABStep Benchmark
@misc{jupyteragentdataset,
title={Jupyter Agent Dataset},
author={Colle, Baptiste and Yukhymenko, Hanna and von Werra, Leandro},
year={2025}
}