-
Notifications
You must be signed in to change notification settings - Fork 118
Open
Description
Describe the bug
Hi team, I have encountered a performance issue when I run same query on a big table with datafusion comparing with DuckDB.
I will try to simplify my case and replicate the issue in my following codes.
To Reproduce
import timeit
import numpy as np
import pyarrow as pa
import datafusion
from datafusion import SessionContext
import duckdb
print(duckdb.__version__)
print(datafusion.__version__)
# prepare data
batches = 100000
names = list("abcdefghijklmnopqrstuvwxyz")
names = [n + m for n in names for m in names]
names_array = pa.concat_arrays([pa.array(names)] * batches)
values_array = pa.concat_arrays([pa.array(np.random.randint(1, 100, len(names))) for _ in range(batches)])
pa_table = pa.Table.from_arrays([names_array, values_array], names=["name", "value"])
# prepare query
sql = "select name, sum(value) as value FROM pa_table group by name;"
n_round = 10
# duckb
elapsed = timeit.timeit('duckdb.sql(sql).to_arrow_table()', number=n_round , globals=globals())
duckdb_per_round = elapsed / n_round
# datafusion
ctx = SessionContext()
_ = ctx.from_arrow(pa_table, "pa_table")
elapsed = timeit.timeit('ctx.sql(sql).to_arrow_table()', number=n_round , globals=globals())
datafusion_per_round= elapsed / n_round
# result
print(f"{'duckdb':<12}: {duckdb_per_round * 1000:.2f}ms")
print(f"{'datafusion':<12}: {datafusion_per_round * 1000:.2f}ms")
the output will look like:
1.3.1
47.0.0
duckdb : 152.15ms
datafusion : 1002.04ms
Expected behavior
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
No labels