-
π₯π₯π₯ [2025-09-04] We are pleased to release LiveSQLBench-Base-Full v1, a new release with 600 NEW tasks over 22 NEW real, complex databases with KB docs.NEW FEATURES: more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values. See the dataset and leaderboard for details
-
π [2025-08-26] We're excited to announce the release of the BIRD-Interact-Full set, which is built on top of LiveSQLBench-Base-Full-v1. It focuses on evaluating the interactive ability of LLMs with both Database and User environments to finish the text-to-SQL tasks. Please visit our project website for more details.
-
π’ [2025-07-28] We are pleased to release LiveSQLBench-Base-Lite-SQLite, extending from PostgreSQL to SQLite dialect to improve accessibility. Please check another repo BIRD-Mini-Dev v2 and dataset for more details.
-
π’ [2025-05-30] We are pleased to release LiveSQLBench-Base-Lite, featuring 18 end-user level databases and 270 tasks (180 SELECT-only, 90 Management tasks). Each task involves unambiguous and straightforward user queries grounded in external knowledge, with medium to hard complexity SQL statements.
LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, continuously evolving benchmark designed to evaluate LLMs on complex, real-world text-to-SQL tasks, featuring diverse real-world user queries, including Business Intelligence (BI), CRUD operations, and more. Each release will include 50 new, fully open-source DBs curated by the BIRD team through expert collaboration and continuous improvement. It will cover a wide range of database sizes, from end-user level (around 127 columns) to industrial level (1340+ columns).
Here are the features of the LiveSQLBench benchmark:
-
ποΈ Live Databases: Constructed dynamically from extensive and regularly updated CSV datasets, with both base (user-end level) and large (industrial level) versions (1340+ columns each DB) to test scalability.
-
π¬ Live User Queries and SQL: Each task pairs unambiguous user queries with annotated, gold-standard SQL statements. The user queries are grounded in an external knowledge base, with medium to hard complexity solution SQL statements.
-
π§ Contextual Reasoning (HKB): Every DB includes a hierarchical knowledge base (HKB) where each knowledge may have dependencies to others, which requires the multi-hop reasoning ability. Two HKB formats are provided: (1) structured JSON format, and (2) unstructured Document format.
-
π The First Full SQL Spectrum: Supports not just SELECT (Business Intelligence) queries, but also CRUD (e.g., UPDATE, CREATE, and other database management operations) queries.
-
β‘ Automated Evaluation: Support fast evaluation via PostgreSQL template & docker. Each question includes verifiable test cases for accurate, reproducible scoring. Soft EX metric is used to evaluate SELECT-ONLY tasks; customized test cases are designed for DBA tasks, such as CRUD (CREATE, READ, UPDATE, DELETE).
-
π Truly Live & Hidden Test: New databases and tasks are added over time. Each release features both open development and hidden test phases. The hidden test set from each release becomes the open development set for the next release, ensuring continuous evolution and fair evaluation.
We have released LiveSQLBench-Base-Lite and LiveSQLBench-Base-Full, featuring:
- New end-user level databases with complex relationships and noisy schema and data.
- New tasks
- Covering both SELECT-only and Management tasks,
- Lite version has straightforward queries, while Full version has more natural, reasoning-intensive queries.
- Both versions are without ambiguity.
- Grounded in external knowledge, with medium to hard complexity SQL statements.
- HKB for multi-hop retrieval and reasoning
- Database: The database can be downloaded from livesqlbench-base-lite and livesqlbench-base-full
- Data Fields:
instance_id
: Unique task identifierselected_database
: Associated database namequery
: User querysol_sql
: Ground truth SQL solutionexternal_knowledge
: IDs of required external knowledgepreprocess_sql
: SQL setup queriesclean_up_sql
: SQL queries to reset database statetest_cases
: Test cases to validate the predicted SQLcategory
: "Query" (SELECT-only) or "Management" (CRUD)high_level
: Boolean for high-level descriptionconditions
: Indicates decimal/distinct conditionsdifficulty_tier
: Task difficulty (Simple, Moderate, Challenging)
Data viewer: Explore our data through data viewer in our website livesqlbench.ai.
π To avoid data leakage by auto-crawling, certain fields (e.g., sol_sql
, test_cases
, external_knowledge
) are excluded from the public dataset livesqlbench_data.jsonl
. For the full dataset, please email: π§ [email protected] with subject tag [livesqlbench-base-lite GT&Test Cases]
for Lite version or [livesqlbench-base-full-v1 GT&Test Cases]
for Full version, which will be sent automatically.
Download the dataset containing DB's hkb, column meaning, schema and the livesqlbench_data.jsonl
file:
cd livesqlbench
git clone https://huggingface.co/datasets/birdsql/livesqlbench-base-lite
Integrate the data file containing the annotated fields (obtained from the email) with the livesqlbench_data.jsonl
file by running the following command:
python integrate_gt_data.py --gt_file <path_to_gt_file>
To run the baseline code you need to install the following dependencies:
conda create -n livesqlbench python=3.10 -y
conda activate livesqlbench
pip install -r requirements.txt
You also need to setup the model name (eg., gpt-4o-2024-11-20) with the API key in the config.py
file. Then you can run the following command to generate the output:
# Generate the prompt
cd baseline/run
bash generate_prompt.sh
# LLM Inference, need to set the API key in config.py
bash run_baseline.sh
The output will be save in the ./evaluation/outputs/final_output/
We use docker to provide a consistent environment for running the benchmark. To set up the environment, follow these steps:
- First download the PostgreSQL database from the Google Drive.
- Unzip the folder and save it in the
./evaluation
named withpostgre_table_dumps
- Build the docker compose
cd evaluation
docker compose up --build
- Interact with the PostgreSQL database (Optional)
Use the
perform_query_on_postgresql_databases()
function in theevaluation/src/db_utils.py
file to interact with the PostgreSQL database.query
is the SQL query you want to run, anddb_name
is the name of the database you want to run the query on. The function will return the result of the query. - Run the evaluation script inside the
so_eval_env
container
docker compose exec so_eval_env bash
cd run
bash run_eval.sh
The output will be save in the ./evaluation/outputs/final_output/
If you want the log file for each instance, you can set the --logging
to true
in the run_eval.sh
script.
Similar to the above, but do the following changes:
- (Optional) If you have already built the docker of Lite version, you can remove it or build a new one for Full version.
- Change the database dumps (postgre_table_dumps) and building script to the full version from the Google Drive.
- Change the databaes metafiles and livesqlbench_data.jsonl to the full version from huggingface.
- Using the new generated prompt for Full version to get LLM outputs.
- Using the new evaluation env to evaluate the LLM outputs.
LiveSQLBench-Base-Lite (2025-05-28)
Rank | Model | Success Rate (%) | Avg. Cost (USD) / Task |
---|---|---|---|
π₯ 1 | o3-mini | 47.78 | 0.0233 |
π₯ 2 | GPT-4.1 | 44.10 | 0.0336 |
π₯ 3 | Claude Sonnet 4 | 42.59 | 0.0623 |
More results can be found here
- π LiveSQLBench-Base-Lite: 18 NEW databases and 270 NEW tasks with straightforward, direct queries.
- π LiveSQLBench-Base-Full: 22 NEW databases and 600 NEW tasks with more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values.
- π LiveSQLBench-Large-Lite: Industrial-scale databases with 1340+ columns
- π LiveSQLBench-Large-Full: Comprehensive large-scale datasets
Feature | LiveSQLBench-Base-Lite | LiveSQLBench-Base-Full | LiveSQLBench-Large-Full |
---|---|---|---|
User Tasks | β’ 270 tasks β’ Clear, direct queries with explicit DB/HKB connections β’ Example1* |
β’ 600 tasks β’ Natural, colloquial queries with implicit DB/HKB connections β’ Example2* |
β’ 600 tasks β’ Natural, colloquial queries with implicit DB/HKB connections β’ Example2*, but with large DBs (industrial-scale DB) |
Database | β’ 18 base databases β’ ~127 columns per DB β’ Simple 1:1 relationships β’ Clean data (no nulls, consistent formats) |
β’ 22 base databases β’ ~127 columns per DB β’ Complex relationships (1:1, 1:N, N:1, N:N) β’ Real-world data quality (e.g., nulls, duplicates, inconsistent formats) |
β’ 22 large databases β’ ~1,340 columns per DB β’ Complex relationships (1:1, 1:N, N:1, N:N) β’ Real-world data quality (e.g., nulls, duplicates, inconsistent formats) |
Hierarchical Knowledge Base (HKB) | β’ Structured HKB-JSON format only | β’ Dual format support: 1. Structured HKB-JSON 2. Unstructured HKB-Document (coming soon) |
β’ Dual format support: 1. Structured HKB-JSON 2. Unstructured HKB-Document (coming soon) |
* Example1 (more formal): "For our archaeological site evaluation, I need to quantify the Digital Preservation Quality metrics across our collection. Please compute a comprehensive DPQ index for each archaeological location. Present the results in descending order of DPQ values, displaying only the site identification code, site designation, and calculated DPQ value (rounded to two decimal places) to facilitate prioritization of our digital preservation resources."
* Example2 (more colloquial): "To better allocate our digital preservation resources, I need to identify which archaeological sites have the most robust digital records. Can you generate a ranked list showing each site's primary identifier, its designation, and a metric quantifying its digital preservation quality? Please sort the list to show the highest-quality sites first, with the quality metric rounded to the second decimal."
Want new dialects? Vote for new SQL dialects π³οΈ here!
BIRD Team & Google Cloud
If you find this work useful, please cite:
@misc{livesqlbench2025,
author = {BIRD Team},
title = {LiveSQLBench: A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks},
year = {2024},
howpublished = {https://github.com/bird-bench/livesqlbench},
note = {Accessed: 2025-05-22}
}
Changelog
All notable changes to this project will be documented in this file.
- Released LiveSQLBench-Base-Full v1 with 600 NEW tasks over 22 NEW real, complex databases with KB docs.
- New features: more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values.
- The leaderboard is updated with some LLM results (e.g., GPT-5, gemini-2.5-pro).
- SQL evaluation postprocessing improvements:
- Removed rounding in SQL postprocessing step
- Added rounding of execution results to 2 decimal places
- Modified comparison logic for ordered conditions:
- Using list comparison when "order" is True in conditions
- Using set comparison when "order" is False in conditions