Suggestions and problems about ArrowReaderBuilder
(orParquetRecordBatchStreamBuilder
)
#4674
Replies: 3 comments 4 replies
-
Beta Was this translation helpful? Give feedback.
-
ArrowReaderBuilder reads and provides access to the ParquetMetadata, including the page index if you enable it? I would recommend checking out DataFusion's ParquetExec which shows how these APIs can be used
I'm not sure why you got this impression, but it is not true. If you provide a RowSelection, derived from the page index or otherwise, it will use this to elide IO and decode Note: I do hope to provide better APIs for interacting with the parquet statistics in futures (#4328) but I've not had sufficient bandwidth lately |
Beta Was this translation helpful? Give feedback.
-
For my point 2 and 3, I have no questions now. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
new_builder
public for more flexible operations.It's more flexible to allow user to pass
ParquetMetaData
manually. For example:If we want to analyze
ParquetMetaData
first (for collecting stats, pruning row groups...), we can pass thisParquetMetaData
to build a reader directly to avoid reading it twice.If we want to prune row groups, we need to call
with_row_groups
onArrowReaderBuilder
. But only if we read the parquet metadata can we know which row groups to prune.ArrowReaderOptions
containspage_index
butArrowReaderBuilder
doesn't use it.After reading the codes I found that neither sync and async
ParquetRecordBatchReader
s can use page index to optimize IO.ArrowReader
have different read options. And the APIs are quite confusing.We can find that if we create a reader by
ArrowReaderBuilder
, we will passArrowReaderOptions
to it.However, if we want to create a sync reader,
ArrowReaderOptions
will be converted toReadOptions
(https://github.com/apache/arrow-rs/blob/master/parquet/src/file/serialized_reader.rs#L172)I think there are some problems:
page_index
fromArrowReaderOptions
to constructReadOptions
. Other options likeReadGroupPredicate
do not exist. So we cannot prune row groups by passing predicates if we create reader byArrowReaderBuilder
.ReadOptions
, which may cause async reader missing some optimizations.I think we should unify them and expose more reasonable APIs.
Beta Was this translation helpful? Give feedback.
All reactions