TPC-H localhost benchmarks for different scale factors

I ran the TPC-H queries locally (Macbook M3 with 16GB of RAM) for Scale Factor 40 and Sail's performance looks great!

<img width="551" height="435" alt="Image" src="https://github.com/user-attachments/assets/ef2fa8fd-9820-4f78-aba9-5b6ae9fce85e" />

Sail works well for all scale factors under 40.  It errors out at SF80 on my machine with this error:

```
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/querybench/run_tpch.py", line 70, in <module>
    sail_res = querybench.sail.tpch_queries.run_benchmarks(spark).rename(columns={"duration": "sail"})
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/querybench/sail/tpch_queries.py", line 105, in run_benchmarks
    benchmark(q4, spark, benchmarks=benchmarks, name="q4")
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/querybench/helpers.py", line 24, in benchmark
    ret = f(dfs, **kwargs)
          ^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/querybench/sail/tpch_queries.py", line 18, in q4
    return spark.sql(querybench.queries.tpch.q4()).collect()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/dataframe.py", line 1778, in collect
    table, schema = self._to_table()
                    ^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/dataframe.py", line 1791, in _to_table
    table, schema, self._execution_info = self._session.client.to_table(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 925, in to_table
    table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(req, observations)
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 1560, in _execute_and_fetch
    for response in self._execute_and_fetch_as_iterator(
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 1537, in _execute_and_fetch_as_iterator
    self._handle_error(error)
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 1811, in _handle_error
    self._handle_rpc_error(error)
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 1882, in _handle_rpc_error
    raise convert_exception(
pyspark.errors.exceptions.connect.IllegalArgumentException: invalid argument: operation not found: 6f6f15ff-e59d-42d8-8236-09230cc2462b
```

Here's the benchmark code in case you'd like to reproduce on your end: https://github.com/MrPowers/querybench

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TPC-H localhost benchmarks for different scale factors #816

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TPC-H localhost benchmarks for different scale factors #816

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions