📄 Arxiv | 🤗 Hugging Face
| Resource | Link | 
|---|---|
| 🤗 MTSQL-R1 (4B) | MTSQL-R1(4B) (Will release after internal review) | 
| 🤗 MTSQL-R1 (1.7B) | MTSQL-R1(1.7B) (Will release after internal review) | 
| 🤗 Dataset | CoSQL-Long-Horizon-SFT-RL-Data (Will release after internal review) | 
| 🤗 Dataset | SParC-Long-Horizon-SFT-RL-Data (Will release after internal review) | 
| Code For SFT | Will release after internal review | 
| Code For RL | Will release after internal review | 
- 🌟 Highlights
 - 📖 Introduction
 - ⚙️ Configuration
 - 🔄 Training Framework
 - 📈 Training Dynamics
 - 📊 Experiment Results
 - 🙏 Acknowledgements
 - 📫 Contact
 
Our approach enables:
- 
Environment-based verification: The model interacts dynamically with two components: (i) a database for execution feedback and (ii) a long- term dialogue memory for explicit coherence checking to verify intermediate SQL outputs.
 - 
Self-correction: Based on verification feedback, the model iteratively refines its generated SQL queries to achieve consistent, executable outputs across multiple turns.
 - 
Autonomous End-to-End Learn actions (Propose, EXECUTE, Verify and Self-Correct) to generate better SQL.
 
Verl == 0.4.1
LLamafactory == 0.9.3
- Step1: Random Sampling with high temperature for generating natural reasoning trajectories
 - Step2: Difficulty-Aware Reject Sampling
 - Step3: SFT Model with Tool-Integrated Multi-Turn Trajectories and Loss Masking
 - Step4: Update Dataset, Model and repeat
 
- Step1: Curriculum Data Partition by difficulty
 - Step2: Outcome and Process Reward Design
 - Step3: Multi-Turn RL with Loss Masking
 
The dynamics of Reward Score and Response Length During Training:
The dynamics of test score across different training checkpoints:
Key Findings and Take Aways:
- Warm-start SFT and RL both provide gains in performance.
 - Small LLMs (1.7B/4B) struggle to follow long-horizon function-calling instructions.
 - Conventional SFT attains good Exact Match but exhibits weaker logical consistency (Execution Match) while Long-Horizon archives better Execution Match.
 - Long-horizon reasoning yields larger gains on multi-turn dialogues and complex questions.
 - long-horizon RL substantially improves out-of-domain performance.
 - Process Dense Reward helps the model learn from harder examples, further boosting performance compared with sparse outcome-only rewards.
 - Stronger function calling, verification, and self-correction correlate with better SQL performance.
 - With long-horizon actions and training, the agent learns to resolve execution failures (even null-return cases - we call it aha-moment in Text-to-SQL) and coherence errors.
 
The evolution of different Long-Horizon Abilities and related Execution Match performance for 4B and 1.7B model
We would like to express our gratitude to the open-source community for their valuable contributions:
- Verl: https://github.com/volcengine/verl
 - LLamafactory: https://github.com/hiyouga/LLaMA-Factory
 - SGLang: https://github.com/sgl-project/sglang
 - VLLM: https://github.com/vllm-project/vllm
 - DB-GPT-Hub: https://github.com/eosphoros-ai/DB-GPT-Hub
 - CoSQL: https://github.com/taoyds/cosql
 - SPaRC: https://github.com/taoyds/sparc
 - Search-R1: https://github.com/PeterGriffinJin/Search-R1
 
......etc
For any issues or discussion, please contact [email protected], thanks














