-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Split clickbench query set into one file per query #16476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
672e22b
to
3312d94
Compare
3312d94
to
9dc2ef2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pepijnve -- this looks great and I think will make a significant improvement to ease of use / rerunning these queries
I also tried running these queries with a local copy of hits and they work great
datafusion-cli -f ~/Software/datafusion2/benchmarks/queries/clickbench/queries/q23.sql
set -e # Exit on any error | ||
|
||
# URL for the raw file (not the GitHub page) | ||
URL="https://raw.githubusercontent.com/ClickHouse/ClickBench/main/datafusion/queries.sql" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also ran this script locally and it works great for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you to Claude for that one. He's my buddy for quick and dirty scripts like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖
Actually, I tried running some tests manually and I got an error that seems to imply something is still looking for clickbench.queries: (venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ ./benchmarks/bench.sh run clickbench_partitioned
***************************
DataFusion Benchmark Script
COMMAND: run
BENCHMARK: clickbench_partitioned
DATAFUSION_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/..
BRANCH_NAME: clickbench_split
DATA_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/data
RESULTS_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
RESULTS_FILE: /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json
Running clickbench (partitioned, 100 files) benchmark...
+ cargo run --release --bin dfbench -- clickbench --iterations 5 --path /Users/andrewlamb/Software/datafusion2/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql -o /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json
Finished `release` profile [optimized] target(s) in 0.28s
Running `/Users/andrewlamb/Software/datafusion2/target/release/dfbench clickbench --iterations 5 --path /Users/andrewlamb/Software/datafusion2/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql -o /Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json`
Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion2/benchmarks/data/hits_partitioned", queries_path: "/Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql", output_path: Some("/Users/andrewlamb/Software/datafusion2/benchmarks/results/clickbench_split/clickbench_partitioned.json") }
Error: Execution("Query path '/Users/andrewlamb/Software/datafusion2/benchmarks/queries/clickbench/queries.sql' does not exist.") |
The same command |
I'll have a look. I ran this locally with a couple of the benchmarks and it worked fine for me, but I must have missed something. |
I had run using a run config in my IDE. Forgot to update the paths in bench.sh. Last commit should fix that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pepijnve -- I tested it again locally and it worked great!
./benchmarks/bench.sh run clickbench_extended
./benchmarks/bench.sh run clickbench_partitioned
./benchmarks/bench.sh run clickbench_1
Thanks again @pepijnve |
Which issue does this PR close?
None
Rationale for this change
Clickbench query IDs are zero-based while most editors are one-based wrt line numbers. This causes a little bit of friction every time you want to check the query for a particular clickbench run.
There's precedent in the benchmark suite already for having one file per query rather than one file with a query per line. By having distinct files, the query id can be reflected in the filename making lookup trivial.
What changes are included in this PR?
Are these changes tested?
Manually tested
Are there any user-facing changes?
No