Slow count on parquet #1013

ashangit · 2025-06-06T10:33:06Z

ashangit
Jun 6, 2025

Hi,

I'm running a job doing union with count on a parquet tables.
Running this count with blaze takes 14 min while it takes 21s without blaze

Here is the config for blaze:

    spark.blaze.enable: "true"
    spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.apache.spark.sql.blaze.BlazeSparkSessionExtension
    spark.shuffle.manager: org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager
    spark.memory.offHeap.enabled: "false"
    spark.executor.memoryOverhead: 27g

From the SQL plan I can see that we spend most of the time in Native.io_time total.
Also the Input bytes for this stage is around 130GB with blaze and only 25MB without balze.

Here the SQL plan with blaze

And without

It looks like we are scanning the whole parquet file to count the number of rows and not relying on metadata.
Am I understanding things correctly or am I missing some config on the job? Also it looks quite slow to get >100MB in >40s (from spark jobs the same kind of action getting >100MB for parquet file is around 20s)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow count on parquet #1013

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Slow count on parquet #1013

Uh oh!

Uh oh!

ashangit Jun 6, 2025

Replies: 0 comments

ashangit
Jun 6, 2025