Skip to content

Split clickbench query set into one file per query #16476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 23, 2025

Conversation

pepijnve
Copy link
Contributor

Which issue does this PR close?

None

Rationale for this change

Clickbench query IDs are zero-based while most editors are one-based wrt line numbers. This causes a little bit of friction every time you want to check the query for a particular clickbench run.

There's precedent in the benchmark suite already for having one file per query rather than one file with a query per line. By having distinct files, the query id can be reflected in the filename making lookup trivial.

What changes are included in this PR?

  • Add a script to download the upstream queries.sql file from the clickbench repo and split it into one file per query
  • Adapt the clickbench benchmark code to read queries from individual files
  • Adjust parameters in bench.sh
  • Adapt the sql_planner benchmark code to read queries from individual files

Are these changes tested?

Manually tested

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Jun 20, 2025
@pepijnve pepijnve force-pushed the clickbench_split branch 2 times, most recently from 672e22b to 3312d94 Compare June 20, 2025 12:32
alamb
alamb previously approved these changes Jun 21, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pepijnve -- this looks great and I think will make a significant improvement to ease of use / rerunning these queries

I also tried running these queries with a local copy of hits and they work great

datafusion-cli -f ~/Software/datafusion2/benchmarks/queries/clickbench/queries/q23.sql

set -e # Exit on any error

# URL for the raw file (not the GitHub page)
URL="https://raw.githubusercontent.com/ClickHouse/ClickBench/main/datafusion/queries.sql"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also ran this script locally and it works great for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you to Claude for that one. He's my buddy for quick and dirty scripts like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖

@alamb
Copy link
Contributor

alamb commented Jun 21, 2025

Actually, I tried running some tests manually and I got an error that seems to imply something is still looking for clickbench.queries:

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ ./benchmarks/bench.sh run clickbench_partitioned
***************************
DataFusion Benchmark Script
COMMAND: run
BENCHMARK: clickbench_partitioned
DATAFUSION_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/..
BRANCH_NAME: clickbench_split
DATA_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/data
RESULTS_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
RESULTS_FILE: /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json
Running clickbench (partitioned, 100 files) benchmark...
+ cargo run --release --bin dfbench -- clickbench --iterations 5 --path /Users/andrewlamb/Software/datafusion2/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql -o /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json
    Finished `release` profile [optimized] target(s) in 0.28s
     Running `/Users/andrewlamb/Software/datafusion2/target/release/dfbench clickbench --iterations 5 --path /Users/andrewlamb/Software/datafusion2/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql -o /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json`
Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion2/benchmarks/data/hits_partitioned", queries_path: "/Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql", output_path: Some("/Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json") }
Error: Execution("Query path '/Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql' does not exist.")

@alamb alamb dismissed their stale review June 21, 2025 11:02

Need to test a bit more

@alamb
Copy link
Contributor

alamb commented Jun 21, 2025

The same command ./benchmarks/bench.sh run clickbench_partitioned passes on main 🤔

@pepijnve
Copy link
Contributor Author

I'll have a look. I ran this locally with a couple of the benchmarks and it worked fine for me, but I must have missed something.

@pepijnve
Copy link
Contributor Author

I had run using a run config in my IDE. Forgot to update the paths in bench.sh. Last commit should fix that.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pepijnve -- I tested it again locally and it worked great!

./benchmarks/bench.sh run clickbench_extended
./benchmarks/bench.sh run clickbench_partitioned
./benchmarks/bench.sh run clickbench_1

@alamb alamb merged commit a862f7a into apache:main Jun 23, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 23, 2025

Thanks again @pepijnve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants