Split clickbench query set into one file per query #16476

pepijnve · 2025-06-20T12:01:16Z

Which issue does this PR close?

None

Rationale for this change

Clickbench query IDs are zero-based while most editors are one-based wrt line numbers. This causes a little bit of friction every time you want to check the query for a particular clickbench run.

There's precedent in the benchmark suite already for having one file per query rather than one file with a query per line. By having distinct files, the query id can be reflected in the filename making lookup trivial.

What changes are included in this PR?

Add a script to download the upstream queries.sql file from the clickbench repo and split it into one file per query
Adapt the clickbench benchmark code to read queries from individual files
Adjust parameters in bench.sh
Adapt the sql_planner benchmark code to read queries from individual files

Are these changes tested?

Manually tested

Are there any user-facing changes?

No

alamb

Thank you @pepijnve -- this looks great and I think will make a significant improvement to ease of use / rerunning these queries

I also tried running these queries with a local copy of hits and they work great

datafusion-cli -f ~/Software/datafusion2/benchmarks/queries/clickbench/queries/q23.sql

alamb · 2025-06-21T10:45:32Z

benchmarks/queries/clickbench/update_queries.sh

+set -e  # Exit on any error
+
+# URL for the raw file (not the GitHub page)
+URL="https://raw.githubusercontent.com/ClickHouse/ClickBench/main/datafusion/queries.sql"


I also ran this script locally and it works great for me

Thank you to Claude for that one. He's my buddy for quick and dirty scripts like this.

alamb · 2025-06-21T11:02:09Z

Actually, I tried running some tests manually and I got an error that seems to imply something is still looking for clickbench.queries:

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ ./benchmarks/bench.sh run clickbench_partitioned
***************************
DataFusion Benchmark Script
COMMAND: run
BENCHMARK: clickbench_partitioned
DATAFUSION_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/..
BRANCH_NAME: clickbench_split
DATA_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/data
RESULTS_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
RESULTS_FILE: /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json
Running clickbench (partitioned, 100 files) benchmark...
+ cargo run --release --bin dfbench -- clickbench --iterations 5 --path /Users/andrewlamb/Software/datafusion2/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql -o /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json
    Finished `release` profile [optimized] target(s) in 0.28s
     Running `/Users/andrewlamb/Software/datafusion2/target/release/dfbench clickbench --iterations 5 --path /Users/andrewlamb/Software/datafusion2/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql -o /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json`
Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion2/benchmarks/data/hits_partitioned", queries_path: "/Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql", output_path: Some("/Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json") }
Error: Execution("Query path '/Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql' does not exist.")

Need to test a bit more

alamb · 2025-06-21T11:14:00Z

The same command ./benchmarks/bench.sh run clickbench_partitioned passes on main 🤔

pepijnve · 2025-06-21T13:32:09Z

I'll have a look. I ran this locally with a couple of the benchmarks and it worked fine for me, but I must have missed something.

pepijnve · 2025-06-21T13:34:42Z

I had run using a run config in my IDE. Forgot to update the paths in bench.sh. Last commit should fix that.

alamb

Thanks @pepijnve -- I tested it again locally and it worked great!

./benchmarks/bench.sh run clickbench_extended
./benchmarks/bench.sh run clickbench_partitioned
./benchmarks/bench.sh run clickbench_1

alamb · 2025-06-23T21:30:36Z

Thanks again @pepijnve

github-actions bot added the core Core DataFusion crate label Jun 20, 2025

pepijnve mentioned this pull request Jun 20, 2025

Make clickbench query IDs 1-based #16455

Closed

pepijnve force-pushed the clickbench_split branch 2 times, most recently from 672e22b to 3312d94 Compare June 20, 2025 12:32

Split clickbench query set into one file per query

9dc2ef2

pepijnve force-pushed the clickbench_split branch from 3312d94 to 9dc2ef2 Compare June 20, 2025 12:54

pepijnve mentioned this pull request Jun 20, 2025

Use Tokio's task budget consistently, better APIs to support task cancellation #16398

Merged

alamb previously approved these changes Jun 21, 2025

View reviewed changes

alamb reviewed Jun 21, 2025

View reviewed changes

Fix queries-path value in bench.sh

065227e

pepijnve mentioned this pull request Jun 21, 2025

Support query filter on all benchmarks #16477

Merged

alamb approved these changes Jun 22, 2025

View reviewed changes

alamb merged commit a862f7a into apache:main Jun 23, 2025
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split clickbench query set into one file per query #16476

Split clickbench query set into one file per query #16476

Uh oh!

pepijnve commented Jun 20, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Jun 21, 2025

Uh oh!

pepijnve Jun 21, 2025

Uh oh!

alamb Jun 21, 2025

Uh oh!

alamb commented Jun 21, 2025

Uh oh!

alamb commented Jun 21, 2025

Uh oh!

pepijnve commented Jun 21, 2025

Uh oh!

pepijnve commented Jun 21, 2025

Uh oh!

alamb left a comment

Uh oh!

Uh oh!

alamb commented Jun 23, 2025

Uh oh!

Uh oh!

Split clickbench query set into one file per query #16476

Split clickbench query set into one file per query #16476

Uh oh!

Conversation

pepijnve commented Jun 20, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

pepijnve Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 21, 2025

Uh oh!

alamb commented Jun 21, 2025

Uh oh!

pepijnve commented Jun 21, 2025

Uh oh!

pepijnve commented Jun 21, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Jun 23, 2025

Uh oh!

Uh oh!