NL2SQL Evaluation Tool

A Streamlit application for side-by-side evaluating natural language to SQL (NL2SQL) model performance.

demo image:

Overview

This tool allows you to:

Compare gold-standard SQL queries with LLM-generated SQL queries
Execute queries via AWS Glue to view and compare results
Evaluate model performance across multiple criteria
Visualize query results with customizable charts
Extract images from the attached EML file
Edit and save evaluation results to a CSV file

Setup

Install uv for python env/packages.
Prepare eval data:
- put the csv in root folder
- put the eml file in root folder if you want to extract images from it; ELSE put all images in a folder named images in root folder
Install the required dependencies:
```
uv sync
```
Configure AWS credentials:
- Ensure you have AWS credentials set up with appropriate permissions for Glue operations.
- copy .env.example to .env and fill in the required fields(aws database name)
Run the application:
```
uv run streamlit run app.py
```
remember to click "Save Evaluation" in each form to save the results to the working copy
Click "Save All Evaluations" at the end to save the evaluations to the original csv file

Features

Evaluation Interface

Load and view NL2SQL evaluation data
Select questions and view their natural language form
Compare gold SQL with generated SQL from different models
Execute SQL queries against AWS Glue tables
Visualize query results with appropriate charts
Rate SQL correctness, result matching, user experience, and chart quality

Image Extraction

The tool provides functionality to extract images from the attached EML file:

Extracts and saves all image attachments
Displays images for visual reference
Allows downloading of extracted images

Working Copy

To protect the original data, the application:

Creates a working copy of the CSV file
All edits are made to the working copy
Provides option to save changes back to the original file when finished

Evaluation Criteria

Evaluations are based on the following criteria:

Correctness (0 or 1): Syntactic and logical correctness of the SQL
ResultMatch (0 or 1): Whether outputs match expected results
UserRating (1-5): Subjective rating on SQL quality
VoiceUsed (0): Always set to 0 as per requirements
ChartRating (1-5): Quality of chart representation
AnalystChartChoice: Preferred chart type for data visualization

Usage Flow

Navigate to the Evaluation page
Select a question to evaluate
Choose a model to review its generated SQL
Enter AWS Glue database name to execute queries
Compare gold standard results with model results
Fill in evaluation criteria
Save evaluations and continue to the next question
When finished, save all changes to the original CSV

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
app.py		app.py
demo.png		demo.png
main.py		main.py
pyproject.toml		pyproject.toml
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NL2SQL Evaluation Tool

Overview

Setup

Features

Evaluation Interface

Image Extraction

Working Copy

Evaluation Criteria

Usage Flow

About

Uh oh!

Releases

Packages

Languages

AnsonDev42/nl2sql-eval

Folders and files

Latest commit

History

Repository files navigation

NL2SQL Evaluation Tool

Overview

Setup

Features

Evaluation Interface

Image Extraction

Working Copy

Evaluation Criteria

Usage Flow

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages