Skip to content

TPC-H localhost benchmarks for different scale factors #816

@MrPowers

Description

@MrPowers

I ran the TPC-H queries locally (Macbook M3 with 16GB of RAM) for Scale Factor 40 and Sail's performance looks great!

Image

Sail works well for all scale factors under 40. It errors out at SF80 on my machine with this error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/querybench/run_tpch.py", line 70, in <module>
    sail_res = querybench.sail.tpch_queries.run_benchmarks(spark).rename(columns={"duration": "sail"})
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/querybench/sail/tpch_queries.py", line 105, in run_benchmarks
    benchmark(q4, spark, benchmarks=benchmarks, name="q4")
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/querybench/helpers.py", line 24, in benchmark
    ret = f(dfs, **kwargs)
          ^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/querybench/sail/tpch_queries.py", line 18, in q4
    return spark.sql(querybench.queries.tpch.q4()).collect()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/dataframe.py", line 1778, in collect
    table, schema = self._to_table()
                    ^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/dataframe.py", line 1791, in _to_table
    table, schema, self._execution_info = self._session.client.to_table(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 925, in to_table
    table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(req, observations)
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 1560, in _execute_and_fetch
    for response in self._execute_and_fetch_as_iterator(
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 1537, in _execute_and_fetch_as_iterator
    self._handle_error(error)
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 1811, in _handle_error
    self._handle_rpc_error(error)
  File "/Users/matthewpowers/Documents/code/my_apps/querybench/.venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 1882, in _handle_rpc_error
    raise convert_exception(
pyspark.errors.exceptions.connect.IllegalArgumentException: invalid argument: operation not found: 6f6f15ff-e59d-42d8-8236-09230cc2462b

Here's the benchmark code in case you'd like to reproduce on your end: https://github.com/MrPowers/querybench

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions