Name	Name	Last commit message	Last commit date
Latest commit History 16 Commits
imgs	imgs
README.md	README.md

📄 Arxiv | 🤗 Hugging Face

Resource	Link
🤗 MTSQL-R1 (4B)	MTSQL-R1(4B) (Will release after internal review)
🤗 MTSQL-R1 (1.7B)	MTSQL-R1(1.7B) (Will release after internal review)
🤗 Dataset	CoSQL-Long-Horizon-SFT-RL-Data (Will release after internal review)
🤗 Dataset	SParC-Long-Horizon-SFT-RL-Data (Will release after internal review)
Code For SFT	Will release after internal review
Code For RL	Will release after internal review

🚀 MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training

📋 Table of Contents

🌟 Highlights
📖 Introduction
⚙️ Configuration
🔄 Training Framework
- Stage1: Self-Taught Warm-Start SFT
- Stage2: End-to-End Long-Horizon Reinforcement Learning
📈 Training Dynamics
📊 Experiment Results
🙏 Acknowledgements
📫 Contact

🌟 Highlights

Category	Feature	Description
Text-to-SQL	🎯 Excellent in Solving Long-Turn and Extra Hard SQL Questions
Text-to-SQL	🔄 Long-Horizon Formulation with Environment Feedback	Leverages environment feedback through database execution and explicit memory verification to guide SQL generation and error correction
LLM Training	🎓 Two-Stage Training Framework	1) Tool-Integrated High-Quality SFT Dataset construction by Self-Taught; Warm-Start SFT 2)Curriculum RL Training with Multi-level rewards (Outcome and Dense Process Reward) Design
LLM Training	🔁 Multi-Turn End-to-End RL Training	Enables end-to-end training across multiple turns with database and memory to enhance coherence

📖 Introduction

Short-horizon Text-to-SQL directly translates question to SQL, resulting execution erros and coherence-related erros.

Our approach enables:

Environment-based verification: The model interacts dynamically with two components: (i) a database for execution feedback and (ii) a long- term dialogue memory for explicit coherence checking to verify intermediate SQL outputs.
Self-correction: Based on verification feedback, the model iteratively refines its generated SQL queries to achieve consistent, executable outputs across multiple turns.
Autonomous End-to-End Learn actions (Propose, EXECUTE, Verify and Self-Correct) to generate better SQL.

⚙️ Configuration

Verl == 0.4.1

LLamafactory == 0.9.3

🔄 Training Framework

Stage1: Self-Taught Warm-Start SFT

Step1: Random Sampling with high temperature for generating natural reasoning trajectories
Step2: Difficulty-Aware Reject Sampling
Step3: SFT Model with Tool-Integrated Multi-Turn Trajectories and Loss Masking
Step4: Update Dataset, Model and repeat

Stage2: End-to-End Long-Horizon Reinforcement Learning

Step1: Curriculum Data Partition by difficulty
Step2: Outcome and Process Reward Design
Step3: Multi-Turn RL with Loss Masking

📈 Training Dynamics

The dynamics of Reward Score and Response Length During Training:

The dynamics of test score across different training checkpoints:

📊 Experiment Results

Overall Experiment Results

Key Findings and Take Aways:

Warm-start SFT and RL both provide gains in performance.
Small LLMs (1.7B/4B) struggle to follow long-horizon function-calling instructions.
Conventional SFT attains good Exact Match but exhibits weaker logical consistency (Execution Match) while Long-Horizon archives better Execution Match.
Long-horizon reasoning yields larger gains on multi-turn dialogues and complex questions.
long-horizon RL substantially improves out-of-domain performance.
Process Dense Reward helps the model learn from harder examples, further boosting performance compared with sparse outcome-only rewards.
Stronger function calling, verification, and self-correction correlate with better SQL performance.
With long-horizon actions and training, the agent learns to resolve execution failures (even null-return cases - we call it aha-moment in Text-to-SQL) and coherence errors.

Performance over different difficulties and turns

The evolution of different Long-Horizon Abilities and related Execution Match performance for 4B and 1.7B model

🙏 Acknowledgements

We would like to express our gratitude to the open-source community for their valuable contributions:

Verl: https://github.com/volcengine/verl
LLamafactory: https://github.com/hiyouga/LLaMA-Factory
SGLang: https://github.com/sgl-project/sglang
VLLM: https://github.com/vllm-project/vllm
DB-GPT-Hub: https://github.com/eosphoros-ai/DB-GPT-Hub
CoSQL: https://github.com/taoyds/cosql
SPaRC: https://github.com/taoyds/sparc
Search-R1: https://github.com/PeterGriffinJin/Search-R1

......etc

📫 Contact

For any issues or discussion, please contact [email protected], thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training

📋 Table of Contents

🌟 Highlights

📖 Introduction

⚙️ Configuration

🔄 Training Framework

Stage1: Self-Taught Warm-Start SFT

Stage2: End-to-End Long-Horizon Reinforcement Learning

📈 Training Dynamics

📊 Experiment Results

Overall Experiment Results

Performance over different difficulties and turns

The evolution of different Long-Horizon Abilities and related Execution Match performance for 4B and 1.7B model

🙏 Acknowledgements

📫 Contact

About

Uh oh!

Releases

Packages

taichengguo/MTSQL-R1

Folders and files

Latest commit

History

Repository files navigation

🚀 MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training

📋 Table of Contents

🌟 Highlights

📖 Introduction

⚙️ Configuration

🔄 Training Framework

Stage1: Self-Taught Warm-Start SFT

Stage2: End-to-End Long-Horizon Reinforcement Learning

📈 Training Dynamics

📊 Experiment Results

Overall Experiment Results

Performance over different difficulties and turns

The evolution of different Long-Horizon Abilities and related Execution Match performance for 4B and 1.7B model

🙏 Acknowledgements

📫 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages