Skip to content

[Epic] Parquet Reader Improvement Plan / Proposal - July 2025 #8000

@alamb

Description

@alamb

This is my attempt to summarize my proposal / plan for improving the Parquet reader in arrow-rs.

Summary

Background

The current arrow-rs reader is great, but has two major areas that I think should be improved:

  1. Speed / efficiency when evaluating pushdown predicates (RowFilter, aka late materalization, see Enable parquet filter pushdown (filter_pushdown) by default datafusion#3463 in DataFusion): [EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456
  2. Flexibility in how the reader intermixes IO and CPU work, so that users can better control the interleaving of IO and CPU work during decode.

To fix pushdown predicates we need two things:

  1. A better representation when predicate results are not clustered togehter (this is just a software engineering exercise I think, see - Adaptive Parquet Predicate Pushdown Evaluation #5523)
  2. Avoid decoding the same column twice when it is needed for a row filter and the output (this is more complex as it requires more resources, and thus a tradeoff)

Hard Coded Tradeoffs between IO, CPU and Memory

Avoiding double decoding requires buffering the results of predicate evaluation,
which is exactly what @XiangpengHao has done in #7850. However, the approach in that PR requires buffering all values for predicate columns within a row group.

Furthermore, we only added caching for the async reader (ParquetRecordBatchStream) because the sync reader (ParquetRecordBatchReader) evaluates filters across all row groups before decoding any data, which means that the cache would likely require too much memory to be practical in most cases.

After study, I don't think there is any way to reduce the cache memory required in the async reader or the sync readers, without changing the IO patterns. Specifically, the current async reader evaluates a filter across all rows in a row group before using a single subsequent IO to read data for all rows which will be output. This optimization for for a single subsequent IO means selective filters can reduce IO for later columns significantly, but it also means that the cache must hold the filter results for all the rows in the RowGroup or the the decoder must decode the column twice (once for the filter and once for the output), as it today.

In addition to the new caching, I think users may wish to make different tradeoffs between IO and memory usage:

  1. More IOs and less internal buffering (e.g when reading from a local SSD or an in memory cache)
  2. Fewer IOs and more internal buffering (e.g. when reading directly from remote object store)

To reduce memory requirements for the cached reader, we will need to change the IO patterns. For the sync reader it may also be desired to avoid decoding the column for the entire row group, and instead decode the column in smaller chunks. However, in some cases this would result in more individual IO operations, which could be slower depending on the storage system.

To that end, I am more convinced than ever that it is time to create a "push decoder" that permits users to
control the IO and CPU work separately (see #7983)

Requests from reviewers:

  1. Do you think the proposal makes sense?
  2. Can you think of additional issues that I have missed?
  3. Are you willing to help implement it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions