- π’ [2025-07-09] We release the human performance scores on our website! The scores displayed across all three leaderboards reflect human evaluators (database experts) who were allowed to use standard tools (database textbooks, official documentation, or IDEs) but not AI assistants. When another group with the same expertise was permitted to use AI tools (ChatGPT, Claude, or Gemini), the performance increased to 83.33 on Open, 87.90 on PG, and 90.00 on Flash, demonstrating the significant potential of human-AI collaboration in SQL problem-solving.
- π’ [2025-06-28] We release our paper SWE-SQL (a.k.a BIRD-CRITIC) on arxiv.
- π’ [2025-06-09] We release bird-interact-lite, feature multi-turn conversational and agentic interaction for real-world ambiguous and challenging text-to-SQL tasks.
- π’ [2025-06-08] We release bird-critic-1.0-postgresql, a single-dialect SQL issue debugging set with 530 complex tasks.
- π’ [2025-05-30] We are pleased to release LiveSQLBench-Base-Lite, featuring 18 end-user level databases and 270 tasks (180 SELECT-only, 90 Management tasks). Each task involves unambiguous and straightforward user queries grounded in external knowledge, with medium to hard complexity SQL statements.
BIRD-Critic 1.0 introduces a novel SQL benchmark designed to evaluate a key capability: Can large language models (LLMs) diagnose and solve user issues within real-world database environments?
The benchmark comprises 600 tasks for development and 200 held-out out-of-distribution (OOD) tests. BIRD-CRITIC 1.0 is built on realistic user issues across 4 prominent open-source SQL dialects: MySQL, PostgreSQL, SQL Server, and Oracle. It expands beyond simple SELECT queries to cover a wider range of SQL operations, reflecting actual application scenarios. Finally, an optimized execution-based evaluation environment is included for rigorous and efficient validation.
Each task in BIRD-CRITIC has been verified by human experts on the following dimensions:
- Reproduction of errors on the BIRD environment to prevent data leakage.
- Carefully curated test case functions for each task specifically.
- Soft EX: This metric can evaluate SELECT-ONLY tasks.
- Soft EX + Parsing: This metric can evaluate tasks with user-specific requirements or refinements.
- Test Case: For DBA tasks, such as CRUD (CREATE, READ, UPDATE, DELETE), test cases are designed to evaluate the correctness of the logic. This is also effective for user issues requiring multiple sequential SQL queries to resolve.
- Query Execution Plan: For user tasks involving efficiency improvement or runtime errors, QEP (Query Execution Plan) can be used to evaluate solution SQL queries at the algorithm level.
- Fast Eval Sandbox via PostgreSQL template & docker.
- Created new RDBs in different scales and professional domains.
We are releasing a lite version of BIRD-Critic, bird-critic-1.0-flash-exp
, which includes 200 high-quality user issues on PostgreSQL when developing real-world applications. We curate tasks by:
- Collecting and understanding realistic user issues.
- Distilling problem definitions and SQL knowledge.
- Reproducing bugs and solutions in the BIRD environment.
- Designing test cases for evaluation.
The open version of BIRD-CRITIC 1.0, bird-critic-1.0-open
, is a comprehensive benchmark that includes 600 tasks across 4 SQL dialects: MySQL, PostgreSQL, SQL Server, and Oracle. It covers a wide range of SQL operations and user issues.
Rank | Model Name | Score | Level |
---|---|---|---|
1 | o3-mini-2025-01-31 | 34.50 | π Leading |
2 | deepseek-reasoner (r1) | 33.67 | π Elite |
3 | o1-preview-2024-09-12 | 33.33 | π Elite |
4 | claude-3-7-sonnet-20250219(thinking) | 30.67 | π Elite |
5 | gemini-2.0-flash-thinking-exp-01-21 | 30.17 | π Elite |
6 | grok-3-beta | 29.83 | π Superior |
Complete results of Open version can be found here. Bird-CRITIC 1.0 Flash result can be found here
The BIRD-CRITIC 1.0 benchmark is available in the following configurations:
bird-critic-1.0-flash-exp
: A lite version consisting of 200 instances on PostgreSQL.bird-critic-1.0-open
: The full version containing 600 instances across MySQL, PostgreSQL, SQL Server, and Oracle.bird-critic-1.0-postgresql
: A 600-instance version specifically for PostgreSQL.bird-critic-1.0-bigquery
: A lite version containing between 100 and 200 instances for BigQuery.
- Database: The complete database can be download from the Google Drive. Check the Quick Eval section for more details.
- data: Each data instance contain the following main parts:
db_id
: The name of the database.query
: The user query is rewritten in the BIRD environment.issue_sql
: The buggy SQL query written by the user.sol_sql
: The ground truth SQL solution.preprocess_sql
: SQL queries to run before executing the solution or prediction.clean_up_sql
: SQL queries to run after the test cases to revert any changes made to the database.test_cases
: A set of test cases to validate the predicted corrected SQL.efficiency
: True if this question needs optimization, measure the cost by Query Execution Plan (QEP)external_data
: For the external JSON data if present
- baseline: The baseline code is available in the
./baseline
directory. - evaluation: The evaluation code is available in the
./evaluation
directory. - Curated by: BIRD Team & Google Cloud
- License: cc-by-sa-4.0
- HuggingFace Dataset Card: bird-critic-1.0-flash-exp
To avoid data leakage by auto-crawling, we do not include GT solution sqls and test cases along with data. please email bird.bench23@gmail.com or bird.bench25@gmail.com for full set, which will be sent automatically.
You can download the dataset from HuggingFace using the following command:
from datasets import load_dataset
# Load the flash version of the dataset
dataset = load_dataset("birdsql/bird-critic-1.0-flash-exp")
print(dataset["flash"][0])
# Load the open version of the dataset
dataset = load_dataset("birdsql/bird-critic-1.0-open")
print(dataset["open"][0])
Or you can use the provided script to download the open version of the dataset and split it into different dialects.
cd baseline/data
python pull_data.py \
--schema_path path/to/open_schema.jsonl \
--input_path path/to/input.jsonl \ # Path to the input JSONL file (may be empty if you want to download the dataset from HuggingFace)
--output_folder path/to/output_dir # output folder of the split files
.
βββ LICENSE
βββ README.md
βββ baseline
βΒ Β βββ data
βΒ Β βββ outputs
βΒ Β βββ run
βΒ Β βββ src
βββ evaluation
βΒ Β βββ docker-compose.yml
βΒ Β βββ env
βΒ Β βββ mssql_table_dumps
βΒ Β βββ mysql_table_dumps
βΒ Β βββ oracle_table_dumps
βΒ Β βββ postgre_table_dumps
βΒ Β βββ run
βΒ Β βββ src
βββ materials
βΒ Β βββ ...
βββ requirements.txt
To run the baseline code you need to install the following dependencies:
conda create -n bird_critic python=3.10 -y
conda activate bird_critic
pip install -r requirements.txt
You also need to setup the model name (eg., gpt-4o-2024-08-06) with the API key in the config.py
file. Then you can run the following command to generate the output:
# Generate the prompt
cd baseline/run
bash generate_prompt.sh
# LLM Inference, need to set the API key in config.py
bash run_baseline.sh
The output will be save in the ./baseline/outputs/final_output/
We use docker to provide a consistent environment for running the benchmark. To set up the environment, follow these steps:
- First download the PostgreSQL, MySQL, SQL Server and Oracle database from the Google Drive.
- Unzip the folder and save it in the
./evaluation
named with postgre_table_dumps,mssql_table_dumps, mysql_table_dumps and oracle_table_dumps. - Build the docker compose
cd evaluation
docker compose up --build
- Interact with the database
You can use the
perform_query_on_{dialect}_databases()
function in theevaluation/src/{dialect}_utils.py
file to interact with the each database. The function will return the result of the query. - Run the evaluation script inside the so_eval_env container
docker compose exec so_eval_env bash
cd run
bash run_eval.sh
You have to specify the dialect you want to evaluate in the run_eval.sh
script. The options are:
postgresql
mysql
sqlserver
oracle
The output report file will be saved in the same folder as your input file. If you want the log file for each instance, you can set the--logging
totrue
in therun_eval.sh
script.
If you find our work helpful, please cite as:
@article{li2025swe,
title={SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications},
author={Li, Jinyang and Li, Xiaolong and Qu, Ge and Jacobsson, Per and Qin, Bowen and Hui, Binyuan and Si, Shuzheng and Huo, Nan and Xu, Xiaohan and Zhang, Yue and others},
journal={arXiv preprint arXiv:2506.18951},
year={2025}
}
- Release lite version, bird-critic-1.0-flash (200).
- Open source code, leaderboard page.
- Release Full bird-critic-1.0-open (570 w/ 4 dialects).
- Release Full bird-critic-1.0-postgresql (530 pg tasks).
- Release SIX-GYM (Sql-fIX), with 2000+ gym-like training environment.
- Release trained agentic baseline BIRD-Fixer.
- Update Agentic (SQL-Act) Baseline.
BIRD Team & Google Cloud