A reproducible benchmark to compare:
- NL → Swan code generation
- Swan Inline (single-line infill)
across:
- Azure OpenAI models (e.g., GPT-4o, o1/o3 via Azure Chat Completions)
- Open-source HF local models (e.g., Qwen3), with optional Grammar-Constrained Decoding (GCD) and Qwen “thinking mode.”
The runner:
- Loads prompt variants from disk (
prompts/<task>/*.txt
) - Lets you select prompts by name/glob or use all
- Creates timestamped run folders with raw results and a single unified
summary.json
- Records per-call metadata (e.g., GCD used, thinking mode)
# 1) Create venv + install
python -m venv .venv
source .venv/bin/activate # Windows: .\.venv\Scripts\Activate.ps1
pip install -U pip
pip install -r requirements.txt # transformers, openai, etc.
# 2) Edit your models (Azure + HF local)
# configs/models.yaml
# 3) Run with all prompts
python -m bench.benchmark_swan \
--data bench/data/examples.json \
--models bench/configs/example_model.yaml \
--tasks nl2swan,inline \
--prompts-dir ./prompts \
--prompt-select-nl2swan all \
--prompt-select-inline all \
--output results \
--num-samples -1
Open the printed run folder and check:
-
results.csv
/results.jsonl
(per-generation rows) -
summary.json
(unified report grouped by model × task × prompt, with GCD/Thinking usage)
swan-generation/
├─ bench/
│ ├─ benchmark_swan.py # main CLI
│ ├─ adapters/
│ │ ├─ azure.py # Azure OpenAI (Chat Completions)
│ │ └─ hf_local.py # HF local (Qwen, GCD, thinking mode)
│ ├─ tasks/
│ │ ├─ nl2swan.py # NL → Swan task
│ │ └─ inline.py # Swan inline (single missing line)
│ ├─ eval.py # exact, token-F1, edit-sim metrics
│ ├─ prompts_loader.py # loads prompts by name/glob
│ ├─ utils.py # helpers (code-fence extraction, timestamps)
│ └─ __init__.py
├─ configs/
│ └─ models.yaml # model endpoints & options
├─ data/
│ ├─ processed/
│ │ └─ train_all_in_one__swan_inline_with_descriptions.json
│ └─ example/
│ └─ examples.json
├─ prompts/
│ ├─ nl2swan/ # one .txt per prompt variant
│ └─ inline/
├─ grammars/
│ └─ grammar_function_node.ebnf # optional GCD grammar
├─ results/ # per-run folders (gitignored)
├─ requirements.txt
└─ README.md
Edit configs/models.yaml
to add your Azure OpenAI endpoints and/or local HF models:
- name: azure-gpt4o
provider: azure-openai
model: GPT-4o # Azure deployment name
api_base: https://<your-endpoint>.openai.azure.com/
api_version: 2024-05-01-preview
api_key: <YOUR_AZURE_OPENAI_KEY>
# Qwen3 without thinking
- name: qwen3-4b
provider: hf-local
model: Qwen/Qwen3-4B-Instruct
temperature: 0.0
top_p: 1.0
do_sample: false
max_new_tokens: 256
use_gcd: false
enable_thinking: false # Qwen thinking mode off
# Qwen3 with thinking
- name: qwen3-8b-thinking
provider: hf-local
model: Qwen/Qwen3-8B-Instruct
temperature: 0.0
top_p: 1.0
do_sample: false
max_new_tokens: 256
use_gcd: true
grammar_path: ./grammars/grammar_function_node.ebnf
start_rule: root
gcd_tasks: [ nl2swan ]
enable_thinking: true # Qwen thinking mode on
Place prompt variants as simple text files:
prompts/
├─ nl2swan/
│ ├─ base.txt
│ ├─ swan_definitions.md
│ ├─ swan_diagrams.md
│ └─ swan_operator.md
└─ inline/
├─ 01_minimal.txt
├─ 02_exact_one_line.txt
└─ 03_no_fences.txt
-
You can select all, a list by name, or globs per task at run time.
-
Empty files and dot/underscore-prefixed files are skipped.
Common flags
--data
: path to JSON dataset--models
: path to models config YAML--tasks
: comma-separated tasks to run (nl2swan
,inline
)--prompts-dir
: path to prompts folder--prompt-select-<task>
:all
to use all prompts for the task- comma-separated list of prompt names (without
.txt
) - globs (e.g.,
*_def*.txt
)
--num-samples
: number of samples to run per model × task × prompt (default:-1
= all)--qwen-thinking
:keep | only | off
(default:keep
) to overrideenable_thinking
per model
All prompts for both tasks:
python -m bench.benchmark_swan \
--data data/processed/train_all_in_one__swan_inline_with_descriptions.json \
--models configs/models.yaml \
--tasks nl2swan,inline \
--prompts-dir ./prompts \
--prompt-select-nl2swan all \
--prompt-select-inline all \
--output results \
--num-samples -1
Pick specific prompt files (order preserved):
python -m bench.benchmark_swan \
--data data/example/examples.json \
--models configs/models.yaml \
--tasks nl2swan \
--prompts-dir ./prompts \
--prompt-select-nl2swan "swan_operator.md, base.txt"
Enable in a HF local model entry:
use_gcd: true
grammar_path: ./grammars/grammar_function_node.ebnf
start_rule: root
gcd_tasks: [nl2swan]
-
The adapter passes the HF tokenizer to
IncrementalGrammarConstraint
. -
We rebuild the logits processor per generation (avoids state carry-over).
-
Ensure your prompt style matches the grammar’s language (e.g., avoid forcing markdown fences if the grammar doesn’t allow backticks).
How to know it was used?:
- See
results.csv
columnsgcd_used
,grammar_path
,start_rule
. - See
summary.json
aggregated counts.
Each run writes a timestamped folder:
results/2025-08-25_14-31-09__tasks-nl2swan-inline__n6/
├─ run_info.json # manifest: prompts used, models (sanitized), paths…
├─ results.csv # per (model × task × prompt × sample) row
├─ results.jsonl # same, as JSONL
└─ summary.json # ONE unified report (metrics + run info + GCD/thinking usage)
model
,task
,prompt_name
,sample_id
,name
pred
,ref
- metrics:
exact_match
,token_f1
(NL → Swan only),edit_sim
- perf:
latency_ms
- meta:
gcd_used
,thinking_enabled
(if applicable)
Contains:
run
(paths, tasks, prompts used)gcd_models
(HF local model GCD settings)counts
(rows, models, tasks, prompts, samples)grouped
:by_model_task_prompt
by_task_prompt_model
by_model_task
by_prompt
Each grouped row includes means for exact
, token_f1
(if present), edit_sim
, latency_ms
, plus:
gcd_used@any
,gcd_used@ratio
thinking_enabled@any
,thinking_enabled@ratio