SQLDRILLER: Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency

SQLDRILLER is a Text-to-SQL translation correctness checking tool. It introduces and applies Execution Consistency to automatically discover and fix errors in Text-to-SQL datasets, and in Text-to-SQL model inference results to improve model accuracy. It is presented in the paper "Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency" (SIGMOD '25, paper link).

Contact: yangyicun@sjtu.edu.cn

Repaired Text-to-SQL Datasets Available

Preview ./data/bird/opt/{train/dev}.json and ./data/spider/opt/{train/dev/test}.json for the repaired Text-to-SQL datasets.

Prerequisites

1. Environment Setup

Basic requirements:

Ubuntu 20.04 is preferred, other Linux OS is also feasible.
Anaconda, Python 3.10 (> 3.9 is feasible)
Docker, Docker Compose (to run VeriEQL for SQL counterexample generation)
Java JDK 17, Gradle 7.3.3 (optional, to run SQLSolver for SQL equivalence)

Then, run the following commands to set up:

# Create and activate Python environment
conda create -n SQLDriller python=3.10
source activate SQLDriller
# If this failed, try this cmd instead:
# `conda activate SQLDriller`
python -m pip install --upgrade pip
pip install -r requirements.txt

# Download nltk lib
python nltk_downloader.py

# Setup docker container for VeriEQL solver 
# (add little modifications to the original code in 2 files)
export CURRENT_PATH=$(pwd)
cd ./third_party/ce_gen/solver/
git clone git@github.com:VeriEQL/VeriEQL.git verieql
cp ./modification/main.py ./verieql/main.py
cp ./modification/cli_within_timeout.py ./verieql/parallel/cli_within_timeout.py
sh verieql.sh
cd $CURRENT_PATH

# clone repo for model accuracy evaluation 
cd ./third_party/
git clone git@github.com:yicun0720/SD-test-suite-accuracy.git test-suite-accuracy
cd $CURRENT_PATH

# grant exec permission for shell scripts
chmod 755 ./click_to_run/*.sh

2. OpenAI Key Configuration

Create an .env file in the root path of this repo, replace the placeholder below to add your own api key information into the file:

touch .env
echo 'OPENAI_API_KEY = "{your-api-key}"' >> .env
echo 'OPENAI_API_BASE = "{your-api-url}"' >> .env

3. Resource Downloading:

We have prepared the following resources for experiments, download them into project root dir ./:

Download the prepared resource files for SQLDriller quick-start and result preview: link.
Download the prepared generated sqlite files for counter-example generation, accuracy evaluation, and NL execution microbenchmark: link.
Download the sqlite files of each schema of each benchmark:
- Spider link, BIRD link, Beaver link

Unzip them by:

unzip prepared.zip -d .
unzip dbs.zip -d .
unzip schema_for_sample_values_spider.zip -d ./data/spider/
unzip schema_for_sample_values_bird.zip -d ./data/bird/
unzip schema_for_sample_values_beaver.zip -d ./data/beaver/

Result Previewing

Since running SQLDriller on the whole large dataset and re-training and re-evaluating models are time-consuming and costly, we provide our evaluated results for quick previewing. The results are exactly in the downloaded ./prepared/ directory:

./prepared/
|-- spider/
  |-- accuracy/
    |-- eval_res/  # Accuracy results of each model
      |-- test/
        |-- pred_refined_train/  # Accuracy result of model trained by fixed training set
          |-- original_test/        # Accuracy on original test set
            |-- ...
          |-- refined_test/         # Accuracy on fixed test set
            |-- {model_name}.tsv   
            |-- {model_name}_SQLDRILLER.tsv # Accuracy of using SQLDriller to further optimize model inference
        |-- pred_original_train/ # Accuracy result of model trained by original training set
          |-- ...
    |-- pred/  # SQL predictions of each model 
  |-- dataset_refine/
    |-- issues/ # Logs of checking each Text-to-SQL case in the dataset
    |-- logs/   # Logs of LLM output when checking each Text-to-SQL case
    |-- sqls/   # Prepared SQL candidate set pre-generated by base model, used for counterexample generation and candidate selection
  |-- microbench/
    |-- spider_test_{LLM_name}/    # NL execution accuracy and logs
    |-- ...
|-- bird/
  |-- ...
|-- stat.xlsx    # The summary of dataset refining results

For detailed guides of reproducing the results, please view the following Scripts to Run section to follow.

Scripts to Run

1. Text-to-SQL Error Study

For the sampled cases in the training datasets, you can view the file data/spider/opt/train_sampled_all.json and data/bird/opt/train_sampled_all.json to check the original and our fixed SQLs. The value of the key "fixed" is None means the original SQL is correct.

As to the statistics of error study, run the following scripts to achieve the results.

./click_to_run/err_study.sh

The results are shown in ./results/study/ (refer to results of Figure 2 and 3, and Table 1 in the paper):

./results/study/error/
|-- Table1_error_rate_by_difficulty.txt
|-- Figure2_error_rate_per_schema.txt
|-- Figure2_error_rate_per_schema_cdf.pdf
|-- Figure3_Jaccard_similarity.pdf  # Jaccard similarity of original and fixed SQLs in sampled error cases

2. Natural Language (NL) Execution

Run the following scripts to perform Spider and BIRD's NL execution tasks on generated database instances. (The database instances for NL execution have been prepared in ./dbs/microbench/{benchmark}/.)

./click_to_run/nl_exec.sh spider gpt-4o
./click_to_run/nl_exec.sh bird gpt-4o

# You can also try other LLMs, which shows similar accuracy as shown in paper's Table 3.
./click_to_run/nl_exec.sh spider gpt-4-turbo
./click_to_run/nl_exec.sh bird gpt-4-turbo

./click_to_run/nl_exec.sh spider o1-preview
./click_to_run/nl_exec.sh bird o1-preview

The 1st parameter represents the benchmark name (Options: spider, bird).
The 2nd parameter represents the used LLM (Options: gpt-4o, gpt-4-turbo, o1-preview). It is set as gpt-4o by default due to its comparable performance and low cost and latency.

All the related results are shown in ./results/nl_exec/{benchmark}_{LLM_name}/. Check nl_exec_accuracy.txt of each subdirectory and refer to results of Table 3 in the paper:

./results/nl_exec/{benchmark}_{LLM_name}/
|-- exec_res/                   # Execution logs of each case
|-- nl_exec_accuracy.txt        # Results of NL exeuction accuracy on this benchmark and LLM

Note that the script only evaluates gpt-4o by default for time and $ overhead, because other LLMs show comparable NL execution accuracy (Table 3 in the paper), while gpt-4o has relatively low cost and latency. You can also experiment by changing the 2nd parameter to other LLMs mentioned above.

3. Dataset Refine and Model Accuracy Evaluation

3.1 Dataset Refine

3.1.1 SQLDriller itself

Run the following command to detect and fix errors in the datasets.

./click_to_run/dataset_refine.sh spider train
./click_to_run/dataset_refine.sh bird train

The 1st parameter represents the benchmark name (Options: spider, bird).
The 2nd parameter represents the dataset split (Options: train, dev, test). We here only fix the train set. The dev and test set have been manually checked for accuracy evaluation as mentioned in the paper.

This script will setup 8 processes to run in parallel (taking about 10~15 hours to finish). You can view the progress of each process by running the following command whenever you want:

./click_to_run/dataset_refine_view_progress.sh {spider/bird} train

If this script outputs that some processes are somehow aborted and need to be resumed, you can resume them by running the recommended command it outputs. Example like:

./click_to_run/dataset_refine.sh {spider/bird} train --resume {partition_id}

3.1.2 LLM Consistency Baseline

Run the following command to do the same thing using LLM consistency-based baseline:

./click_to_run/dataset_refine_baseline.sh spider train
./click_to_run/dataset_refine_baseline.sh bird train

3.1.3 Evaluate Them All

Then run the following command to evaluate the effectiveness of SQLDriller and the baseline on error detection and fixing:

./click_to_run/dataset_refine_eval.sh spider
./click_to_run/dataset_refine_eval.sh bird

The results are shown in ./results/dataset_refine/{benchmark}_train/:

./results/dataset_refine/{benchmark}_train/
|-- SQLDriller/     # SQLDriller's results
    |-- exec_res/           # Logs of checking each Text-to-SQL case in the dataset
      |-- 0/                    # A subdir for each case id
        |-- ce.txt                 # Generated counterexample of the case (qtext form)
        |-- gold_pred_ce_res.json  # Gold and predicted SQLs' exec. results on counterexamples
        |-- pred_ce_res.json       # Predicted SQL's exec. results on counterexamples among them, when ties of their scores happens (see paper for details)
      |-- 1/
      |-- ...
    |-- modified_gold.tsv   # The fixed gold SQLs in fixed cases
    |-- modified_gold_tagged.tsv   # The fixed gold SQLs in fixed cases w/ tagged fix results
    |-- train.json          # The fixed dataset file
|-- LLMConsis_baseline/     # Baseline's results
    |-- exec_res/           # Logs of checking each Text-to-SQL case w/ baseline
      |-- ...
    |-- modified_gold.tsv   # The fixed gold SQLs in fixed cases
    |-- modified_gold_tagged.tsv   # The fixed gold SQLs in fixed cases w/ tagged fix results
|-- statistics.txt    # The statistics of SQLDriller and LLM consistency baseline's effectiveness on error detection and fixing

In ./results/dataset_refine/{benchmark}_train/, view statistics.txt for a summary of effectiveness evaluation (refer to results of Table 5 and 6 in the paper).

Then, ./results/dataset_refine/{benchmark}_train/SQLDriller/train.json can be used to fine-tune existing Text-to-SQL models.

3.2 Model Accuracy Evaluation

Run the following command to use SQLDriller to improve model inference accuracy directly.

./click_to_run/inference_optimize.sh spider test dail
./click_to_run/inference_optimize.sh spider test din
./click_to_run/inference_optimize.sh spider test resd
./click_to_run/inference_optimize.sh spider test graphix-T5

./click_to_run/inference_optimize.sh bird dev sftcodes
./click_to_run/inference_optimize.sh bird dev codes

The 1st parameter represents the benchmark name (Options: spider, bird).
The 2nd parameter represents the dataset split. test set for Spider and dev set for BIRD.
The 3rd parameter represents the model. Each evaluated model has a file w/ multiple prediction candidates for each test case.

Run the following command to do the same thing using LLM consistency-based baseline:

./click_to_run/inference_optimize_baseline.sh spider test dail
./click_to_run/inference_optimize_baseline.sh spider test din
./click_to_run/inference_optimize_baseline.sh spider test resd
./click_to_run/inference_optimize_baseline.sh spider test graphix-T5

./click_to_run/inference_optimize_baseline.sh bird dev sftcodes
./click_to_run/inference_optimize_baseline.sh bird dev codes

Then run the following command to summarize the accuracy improvements of all the evaluated models:

./click_to_run/inference_accuracy_eval.sh

In ./results/inference/, view accuracy_improvement.tsv for a summary of accuracy improvements (refer to Table 4 in the paper) and Figure11_accuracy_improvement_breakdown.pdf for accuracy improvement breakdown results (refer to Figure 11 in the paper). View ./results/inference/{model_name}/*.tsv for accuracy evaluation results of each model (the bottom line of each .tsv file shows overall accuracy result):

./results/inference/
|-- {model_name}/   # accuracy results of each model
  |-- {model_name}_original_train.tsv
  |-- {model_name}_refined_train.tsv
  |-- {model_name}_opt_SQLDriller.tsv
  |-- {model_name}_opt_llmconsis.tsv
|-- accuracy_improvement.tsv
|-- Figure11_accuracy_improvement_breakdown.pdf

Note: for each model, it outputs 4 results of model accuracy:

{model_name}_original_train.tsv: accuracy w/ original train set
{model_name}_refined_train.tsv: accuracy w/ refined train set
{model_name}_opt_SQLDriller.tsv: accuracy w/ refined train set and SQLdriller's optimization of model inference
{model_name}_opt_llmconsis.tsv: accuracy w/ refined train set and LLM consistency baseline's optimization of model inference

Reference

@article{10.1145/3725271,
author = {Yang, Yicun and Wang, Zhaoguo and Xia, Yu and Wei, Zhuoran and Ding, Haoran and Piskac, Ruzica and Chen, Haibo and Li, Jinyang},
title = {Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency},
year = {2025},
issue_date = {June 2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {3},
number = {3},
url = {https://doi.org/10.1145/3725271},
doi = {10.1145/3725271},
journal = {Proc. ACM Manag. Data},
month = jun,
articleno = {134},
numpages = {28},
keywords = {SQL query equivalence, execution consistency, language model, natural language interface for databases, text-to-SQL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
click_to_run		click_to_run
data		data
paper		paper
python_scripts		python_scripts
scripts		scripts
third_party		third_party
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset_refine.py		dataset_refine.py
dataset_refine_eval.py		dataset_refine_eval.py
inference_accuracy_eval.py		inference_accuracy_eval.py
inference_optimize.py		inference_optimize.py
llm_consistency_baseline.py		llm_consistency_baseline.py
nl_exec_microbench.py		nl_exec_microbench.py
nl_exec_microbench_sql.py		nl_exec_microbench_sql.py
nltk_downloader.py		nltk_downloader.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SQLDRILLER: Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency

Repaired Text-to-SQL Datasets Available

Prerequisites

1. Environment Setup

2. OpenAI Key Configuration

3. Resource Downloading:

Result Previewing

Scripts to Run

1. Text-to-SQL Error Study

2. Natural Language (NL) Execution

3. Dataset Refine and Model Accuracy Evaluation

3.1 Dataset Refine

3.1.1 SQLDriller itself

3.1.2 LLM Consistency Baseline

3.1.3 Evaluate Them All

3.2 Model Accuracy Evaluation

Reference

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

SJTU-IPADS/SQLDriller

Folders and files

Latest commit

History

Repository files navigation

SQLDRILLER: Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency

Repaired Text-to-SQL Datasets Available

Prerequisites

1. Environment Setup

2. OpenAI Key Configuration

3. Resource Downloading:

Result Previewing

Scripts to Run

1. Text-to-SQL Error Study

2. Natural Language (NL) Execution

3. Dataset Refine and Model Accuracy Evaluation

3.1 Dataset Refine

3.1.1 SQLDriller itself

3.1.2 LLM Consistency Baseline

3.1.3 Evaluate Them All

3.2 Model Accuracy Evaluation

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages