📄 Arxiv | 🤗 Hugging Face
| Resource | Link |
|---|---|
| 🤗 MTSQL-R1 (4B) | MTSQL-R1(4B) (Will release after internal review) |
| 🤗 MTSQL-R1 (1.7B) | MTSQL-R1(1.7B) (Will release after internal review) |
| 🤗 Dataset | CoSQL-Long-Horizon-SFT-RL-Data (Will release after internal review) |
| 🤗 Dataset | SParC-Long-Horizon-SFT-RL-Data (Will release after internal review) |
| Code For SFT | Will release after internal review |
| Code For RL | Will release after internal review |
- 🌟 Highlights
- 📖 Introduction
- ⚙️ Configuration
- 🔄 Training Framework
- 📈 Training Dynamics
- 📊 Experiment Results
- 🙏 Acknowledgements
- 📫 Contact
Our approach enables:
-
Environment-based verification: The model interacts dynamically with two components: (i) a database for execution feedback and (ii) a long- term dialogue memory for explicit coherence checking to verify intermediate SQL outputs.
-
Self-correction: Based on verification feedback, the model iteratively refines its generated SQL queries to achieve consistent, executable outputs across multiple turns.
-
Autonomous End-to-End Learn actions (Propose, EXECUTE, Verify and Self-Correct) to generate better SQL.
Verl == 0.4.1
LLamafactory == 0.9.3
- Step1: Random Sampling with high temperature for generating natural reasoning trajectories
- Step2: Difficulty-Aware Reject Sampling
- Step3: SFT Model with Tool-Integrated Multi-Turn Trajectories and Loss Masking
- Step4: Update Dataset, Model and repeat
- Step1: Curriculum Data Partition by difficulty
- Step2: Outcome and Process Reward Design
- Step3: Multi-Turn RL with Loss Masking
The dynamics of Reward Score and Response Length During Training:
The dynamics of test score across different training checkpoints:
Key Findings and Take Aways:
- Warm-start SFT and RL both provide gains in performance.
- Small LLMs (1.7B/4B) struggle to follow long-horizon function-calling instructions.
- Conventional SFT attains good Exact Match but exhibits weaker logical consistency (Execution Match) while Long-Horizon archives better Execution Match.
- Long-horizon reasoning yields larger gains on multi-turn dialogues and complex questions.
- long-horizon RL substantially improves out-of-domain performance.
- Process Dense Reward helps the model learn from harder examples, further boosting performance compared with sparse outcome-only rewards.
- Stronger function calling, verification, and self-correction correlate with better SQL performance.
- With long-horizon actions and training, the agent learns to resolve execution failures (even null-return cases - we call it aha-moment in Text-to-SQL) and coherence errors.
The evolution of different Long-Horizon Abilities and related Execution Match performance for 4B and 1.7B model
We would like to express our gratitude to the open-source community for their valuable contributions:
- Verl: https://github.com/volcengine/verl
- LLamafactory: https://github.com/hiyouga/LLaMA-Factory
- SGLang: https://github.com/sgl-project/sglang
- VLLM: https://github.com/vllm-project/vllm
- DB-GPT-Hub: https://github.com/eosphoros-ai/DB-GPT-Hub
- CoSQL: https://github.com/taoyds/cosql
- SPaRC: https://github.com/taoyds/sparc
- Search-R1: https://github.com/PeterGriffinJin/Search-R1
......etc
For any issues or discussion, please contact [email protected], thanks














