[Epic] Parquet Reader Improvement Plan / Proposal - July 2025

This is my attempt to summarize my proposal / plan for improving the Parquet reader in arrow-rs.

## Summary
- [ ] Merge predicate cache for async reader: https://github.com/apache/arrow-rs/pull/7850
- [ ] Work on better row selection representation: https://github.com/apache/arrow-rs/issues/5523
- [ ] Create a "push decoder" API for find grained control of IO and CPU work (and predicate caching): https://github.com/apache/arrow-rs/issues/7983
- [ ] Refactor sync and async readers to use the push decoder


## Background
The current arrow-rs reader is great, but has two major areas that I think should be improved:
1. Speed / efficiency when evaluating pushdown predicates ([`RowFilter`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html), aka late materalization, see https://github.com/apache/datafusion/issues/3463 in DataFusion): https://github.com/apache/arrow-rs/issues/7456
2. Flexibility in how the reader intermixes IO and CPU work, so that users can better control the interleaving of IO and CPU work *during* decode.

To fix pushdown predicates we need two things:
1. A better representation when predicate results are not clustered togehter (this is just a software engineering exercise I think, see - https://github.com/apache/arrow-rs/issues/5523)
2. Avoid decoding the same column twice when it is needed for a row filter and the output (this is more complex as it requires more resources, and thus a tradeoff)

## Hard Coded Tradeoffs between IO, CPU and Memory

Avoiding double decoding requires buffering the results of predicate evaluation,
which is exactly what @XiangpengHao has done in https://github.com/apache/arrow-rs/pull/7850. However, the approach in that PR requires buffering *all values* for predicate columns within a row group.

Furthermore, we only added caching for the async reader ([`ParquetRecordBatchStream`](https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.ParquetRecordBatchStream.html)) because the sync reader ([`ParquetRecordBatchReader`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ParquetRecordBatchReader.html)) evaluates filters across *all row groups* before decoding any data, which means that the cache would likely require too much memory to be practical in most cases.

After study, I don't think there is any way to reduce the cache memory required in the async reader or the sync readers, without changing the IO patterns. Specifically, the current async reader evaluates a filter across all rows in a row group before using a **single subsequent IO** to read data for all rows which will be output. This optimization for for a single subsequent IO means selective filters can reduce IO for later columns significantly, but it also means that the cache must hold the filter results for all the rows in the RowGroup or the the decoder must decode the column twice (once for the filter and once for the output), as it today.

In addition to the new caching, I think users may wish to make different tradeoffs between IO and memory usage:
1. More IOs and less internal buffering  (e.g when reading from a local SSD or an in memory cache)
2. Fewer IOs and more internal buffering (e.g. when reading directly from remote object store)

To reduce memory requirements for the cached reader, we will need to change the IO patterns. For the sync reader it may also be desired to avoid decoding the column for the entire row group, and instead decode the column in smaller chunks. However, in some cases this would result in more individual IO operations, which could be slower depending on the storage system.

To that end, I am more convinced than ever that it is time to create a "push decoder" that permits users to
control the IO and CPU work separately (see https://github.com/apache/arrow-rs/issues/7983)

## Requests from reviewers:
1. Do you think the proposal makes sense?
2. Can you think of additional issues that I have missed?
3. Are you willing to help implement it?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Epic] Parquet Reader Improvement Plan / Proposal - July 2025 #8000

Summary

Background

Hard Coded Tradeoffs between IO, CPU and Memory

Requests from reviewers:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Epic] Parquet Reader Improvement Plan / Proposal - July 2025 #8000

Description

Summary

Background

Hard Coded Tradeoffs between IO, CPU and Memory

Requests from reviewers:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions