Suggestions and problems about `ArrowReaderBuilder` (or`ParquetRecordBatchStreamBuilder`) #4674

RinChanNOWWW · 2023-08-10T07:45:09Z

RinChanNOWWW
Aug 10, 2023

Make new_builder public for more flexible operations.

impl<T: AsyncFileReader + Send + 'static> ArrowReaderBuilder<AsyncReader<T>> {
    /// Create a new [`ParquetRecordBatchStreamBuilder`] with the provided parquet file
    pub async fn new(mut input: T) -> Result<Self> {
        let metadata = input.get_metadata().await?;
        Self::new_builder(AsyncReader(input), metadata, Default::default())
    }
}

It's more flexible to allow user to pass ParquetMetaData manually. For example:
If we want to analyze ParquetMetaData first (for collecting stats, pruning row groups...), we can pass this ParquetMetaData to build a reader directly to avoid reading it twice.

If we want to prune row groups, we need to call with_row_groups on ArrowReaderBuilder. But only if we read the parquet metadata can we know which row groups to prune.

ArrowReaderOptions contains page_index but ArrowReaderBuilder doesn't use it.

After reading the codes I found that neither sync and async ParquetRecordBatchReaders can use page index to optimize IO.

Async and sync ArrowReader have different read options. And the APIs are quite confusing.

We can find that if we create a reader by ArrowReaderBuilder, we will pass ArrowReaderOptions to it.

However, if we want to create a sync reader, ArrowReaderOptions will be converted to ReadOptions (https://github.com/apache/arrow-rs/blob/master/parquet/src/file/serialized_reader.rs#L172)

I think there are some problems:

We can only get page_index from ArrowReaderOptions to construct ReadOptions. Other options like ReadGroupPredicate do not exist. So we cannot prune row groups by passing predicates if we create reader by ArrowReaderBuilder.
Async reader will not construct ReadOptions, which may cause async reader missing some optimizations.
We can make reader read page index, but how does it use the page index? There is no API for us to evaluate the page index to skip IO.

I think we should unify them and expose more reasonable APIs.

RinChanNOWWW · 2023-08-10T07:48:17Z

RinChanNOWWW
Aug 10, 2023
Author

PTAL @tustvold @alamb

0 replies

tustvold · 2023-08-10T08:06:24Z

tustvold
Aug 10, 2023
Collaborator

ArrowReaderBuilder reads and provides access to the ParquetMetadata, including the page index if you enable it?

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.metadata

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_page_index

I would recommend checking out DataFusion's ParquetExec which shows how these APIs can be used

can use page index to optimize IO.

I'm not sure why you got this impression, but it is not true. If you provide a RowSelection, derived from the page index or otherwise, it will use this to elide IO and decode

Note: I do hope to provide better APIs for interacting with the parquet statistics in futures (#4328) but I've not had sufficient bandwidth lately

4 replies

RinChanNOWWW Aug 10, 2023
Author

If you provide a RowSelection, derived from the page index or otherwise, it will use this to elide IO and decode

You are right. However, if we want to get row selections, we needed to read metadata at once. However, when building ParquetRecordBatchReader, it will read metadata again. I think we can reduce this reading.

I mean, ParquetRecordBatchReader will not use page index during reading. For example, it's reasonable to pass a PagePredicate or something else to it to achieve this goal. As ParquetRecordBatchReader will not use page index, why ArrowReaderOptions contains page_index?

tustvold Aug 10, 2023
Collaborator

As ParquetRecordBatchReader will not use page index

The offset index, which is part of the page index, is very important for efficiently applying RowSelection (and RowFilter) and is used extensively in both the sync and async variants.

However, if we want to get row selections, we needed to read metadata at once. However, when building ParquetRecordBatchReader, it will read metadata again

Why not do the following

let options = ArrowReaderOptions::new().with_page_index(true);
let mut builder = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options).unwrap();

let selection = compute_page_index(builder.metadata())?;
builder.with_row_selection(selection).build()?

This will read the metadata only once

RinChanNOWWW Aug 10, 2023
Author

Thanks for your advice, I know what you mean. But there are still some difficulties for my implementation.

I want to collect all the parquet stats when building plan(before executing the full plan). And our SQL optimizer can use these stats to optimize the plan.
I want to read row groups parallelly. For example, I can build a reader with row group 0, another reader with row group 3, and make them run in different threads.

If the reader can be built from ParquetMetaData, we can only read metadata once and achieve these goals.

tustvold Aug 10, 2023
Collaborator

#4676 contains some proposed API changes / docs improvements PTAL

RinChanNOWWW · 2023-08-10T08:53:18Z

RinChanNOWWW
Aug 10, 2023
Author

For my point 2 and 3, I have no questions now.

0 replies

Suggestions and problems about ArrowReaderBuilder (orParquetRecordBatchStreamBuilder) #4674

Uh oh!

Uh oh!

RinChanNOWWW Aug 10, 2023

Replies: 3 comments · 4 replies

Uh oh!

RinChanNOWWW Aug 10, 2023 Author

Uh oh!

Uh oh!

tustvold Aug 10, 2023 Collaborator

Uh oh!

RinChanNOWWW Aug 10, 2023 Author

Uh oh!

tustvold Aug 10, 2023 Collaborator

Uh oh!

Uh oh!

RinChanNOWWW Aug 10, 2023 Author

Uh oh!

tustvold Aug 10, 2023 Collaborator

Uh oh!

RinChanNOWWW Aug 10, 2023 Author

Suggestions and problems about `ArrowReaderBuilder` (or`ParquetRecordBatchStreamBuilder`) #4674

RinChanNOWWW
Aug 10, 2023

Replies: 3 comments 4 replies

RinChanNOWWW
Aug 10, 2023
Author

tustvold
Aug 10, 2023
Collaborator

RinChanNOWWW Aug 10, 2023
Author

tustvold Aug 10, 2023
Collaborator

RinChanNOWWW Aug 10, 2023
Author

tustvold Aug 10, 2023
Collaborator

RinChanNOWWW
Aug 10, 2023
Author