Skip to content

Some group by query is 6~7x slower than DuckDB #1186

@wegamekinglc

Description

@wegamekinglc

Describe the bug

Hi team, I have encountered a performance issue when I run same query on a big table with datafusion comparing with DuckDB.

I will try to simplify my case and replicate the issue in my following codes.

To Reproduce

import timeit
import numpy as np
import pyarrow as pa
import datafusion
from datafusion import SessionContext
import duckdb

print(duckdb.__version__)
print(datafusion.__version__)

# prepare data

batches = 100000

names = list("abcdefghijklmnopqrstuvwxyz")
names = [n + m for n in names for m in names]

names_array = pa.concat_arrays([pa.array(names)] * batches)
values_array = pa.concat_arrays([pa.array(np.random.randint(1, 100, len(names))) for _ in range(batches)])

pa_table = pa.Table.from_arrays([names_array, values_array], names=["name", "value"])

# prepare query
sql = "select name, sum(value) as value FROM pa_table group by name;"
n_round = 10

# duckb
elapsed = timeit.timeit('duckdb.sql(sql).to_arrow_table()', number=n_round , globals=globals())
duckdb_per_round = elapsed  / n_round

# datafusion
ctx = SessionContext()
_ = ctx.from_arrow(pa_table, "pa_table")
elapsed  = timeit.timeit('ctx.sql(sql).to_arrow_table()', number=n_round , globals=globals())
datafusion_per_round= elapsed  / n_round

# result
print(f"{'duckdb':<12}: {duckdb_per_round * 1000:.2f}ms")
print(f"{'datafusion':<12}: {datafusion_per_round * 1000:.2f}ms")

the output will look like:

1.3.1
47.0.0
duckdb      : 152.15ms
datafusion  : 1002.04ms

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions