Skip to content

Add microbenchmark for spilling with compression #16512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ding-young
Copy link
Contributor

@ding-young ding-young commented Jun 23, 2025

Which issue does this PR close?

What changes are included in this PR?

This pr adds some microbenchmarks to compare performance characteristics between different compression codecs. It generates 50 RecordBatch for each case, and run both write & read. It manually prints out compression ratio (mem_bytes / disk_bytes) after each run.

To make benchmark more realistic, this pr generates RecordBatches that resemble the data it spills on AggregateExec (tpc-h) and SortExec (sort-tpch). It covers both thin batch consists of primitive arrays and wide batches with complex data type.

Rationale for this change

Benchmark Case Overview

Below are schema of RecordBatch & original query for each benchmark case.

  • Q2 [Int64(partkey), Decimal128(min(ps_supplycost))]
select
...
    p_partkey,
...
where
        p_partkey = ps_partkey
...
  and ps_supplycost = (
    select
        min(ps_supplycost)
  • Q16 [Utf8(p_brand), Utf8(p_type), Int32(p_size), Int64(supplier_cnt)]
select
    p_brand,
    p_type,
    p_size,
    count(distinct ps_suppkey) as supplier_cnt
...
group by
    p_brand, p_type, p_size
... 
  • Q20 [Int64(suppkey), Int64(partkey), Decimal128(sum(l_quantity))]
... select
            ps_suppkey
        from
            partsupp
        where
                ps_partkey in (
                select
                    p_partkey
                    ...
            )
          and ps_availqty > (
            select
                    0.5 * sum(l_quantity)
  • Sort-tpch Q10 wide [Int32, Int64 * 3, Decimal128 * 4, Date * 3, Utf8 * 4]
SELECT l_orderkey, l_suppkey, l_linenumber, l_comment,
         l_partkey, l_quantity, l_extendedprice, l_discount, l_tax,
         l_returnflag, l_linestatus, l_shipdate, l_commitdate,
      l_receiptdate, l_shipinstruct, l_shipmode
FROM lineitem
ORDER BY l_orderkey, l_suppkey, l_linenumber, l_comment

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Jun 23, 2025
@ding-young
Copy link
Contributor Author

ding-young commented Jun 23, 2025

To run bench, cargo bench --bench spill_io

Results

Case Compression Time (ms) Memory (MB) Disk (MB) Compression Ratio
Q2 Uncompressed 51.521 9.4 9.5 0.990
Q2 Zstd 147.360 9.4 1.5461 6.215
Q2 Lz4Frame 97.942 9.4 3.2 2.922
Q16 Uncompressed 78.053 23.5 19.5 1.209
Q16 Zstd 236.480 23.5 4.4 5.373
Q16 Lz4Frame 145.180 23.5 7.8 3.007
Q20 Uncompressed 64.233 12.5 12.7 0.989
Q20 Zstd 190.570 12.5 2.4 5.282
Q20 Lz4Frame 123.430 12.5 4.8 2.629
Wide (Q10) Uncompressed 215.220 56.4 54.2 1.041
Wide (Q10) Zstd 443.190 56.4 11.6 4.857
Wide (Q10) Lz4Frame 255.530 56.4 19.9 2.834

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant