Skip to content

Support multi-threaded writing of Parquet files with modular encryption #7818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

rok
Copy link
Member

@rok rok commented Jun 29, 2025

Closes #7359.

Rationale for this change

This is to enable concurrent column writing with encryption downstream (e.g. with datafusion). See #7359 for more.
See https://github.com/apache/arrow-rs/pull/7111/files#r2015196618

What changes are included in this PR?

  • ArrowRowGroupWriterFactory.create_row_group_writer now passes a FileEncryptor to ArrowColumnWriterFactory that can later be used to write columns concurrently.
  • Several methods and properties are now pub.
  • Minor change to how encryption tests read test data.

Are these changes tested?

Yes.

Are there any user-facing changes?

ArrowRowGroupWriterFactory, ArrowRowGroupWriter, ArrowRowGroupWriterFactory::new, ArrowRowGroupWriterFactory.create_row_group_writer , .. are now pub. Shouldn't be breaking but perhaps needs review.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 29, 2025
@rok rok force-pushed the multi-threaded_encrypted_writing branch from ee38fb5 to cd5196f Compare June 29, 2025 00:03
@rok rok requested a review from adamreeve June 29, 2025 00:04
@rok rok force-pushed the multi-threaded_encrypted_writing branch from cd5196f to c2ac2d9 Compare June 29, 2025 10:41
Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like the approach of exposing the ArrowRowGroupWriterFactory to handle passing through the file encryption properties. We should probably also update the documentation on ArrowColumnWriter to encourage users to use this API, and/or add documentation to get_column_writers to encourage users to use this approach for compatibility with encryption.

There seem to be a lot of extra things that have been exposed publicly in this PR that don't need to be though, which makes it hard to follow what changes are actually needed. We should minimize how much is public so it's easier for users to follow documentation and see which APIs they should be using, and so that internal details can be changed later.

Co-authored-by: Adam Reeve <adreeve@gmail.com>
@rok rok force-pushed the multi-threaded_encrypted_writing branch 2 times, most recently from 86ddf45 to b35d078 Compare June 30, 2025 10:37
@rok rok force-pushed the multi-threaded_encrypted_writing branch from b35d078 to 2d42ca0 Compare June 30, 2025 10:42
@rok
Copy link
Member Author

rok commented Jun 30, 2025

Thanks for the quick review @adamreeve !
I've moved schemas and writer properties to ArrowRowGroupWriterFactory as you suggested.
Test also now uses tokio channels.

@rok rok force-pushed the multi-threaded_encrypted_writing branch from bc293de to 4009f42 Compare June 30, 2025 12:34
@rok rok force-pushed the multi-threaded_encrypted_writing branch from 4009f42 to b57a5d4 Compare June 30, 2025 12:43
Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good thanks Rok, I've added a few more comments

@rok
Copy link
Member Author

rok commented Jul 10, 2025

I've opened a draft PR against datafusion to check if this is usable for multi-threaded writes and it does seem to be. Switching this to ready for review.

@rok rok marked this pull request as ready for review July 10, 2025 13:19
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rok and @adamreeve -- this is very exciting.

I have one question about API design but otherwise the basic idea looks great to me

@@ -755,7 +755,7 @@ impl ArrowColumnWriter {
}

/// Encodes [`RecordBatch`] to a parquet row group
struct ArrowRowGroupWriter {
pub struct ArrowRowGroupWriter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't use the existing low level APIs documented here: https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel

In other words, do we really need to make what has previously been an implementation detail part of the public API?

Copy link
Member Author

@rok rok Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick response @alamb ! :)
Looking at the example - we have to pass a single private FileEncryptor object (it has random state) to all ArrowColumnWriters, SerializedRowGroupWriters and SerializedFileWriter. Perhaps we could hide everything into ArrowWriter. I'll have to do a quick check. If you have a cleaner idea we'd be most interested!

Copy link
Contributor

@alamb alamb Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think avoiding exposing ArrowRowGroupWriter and the associated machinery that would be good

The rationale is that @XiangpengHao and @zhuqi-lucas and myself are likely to be reworking it in the next few releases and once it is part of the public API changing it becomes harder as it requires more coordination / backwards compatibility concerns

@@ -457,7 +457,7 @@ pub struct WriterPropertiesBuilder {

impl WriterPropertiesBuilder {
/// Returns default state of the builder.
fn with_defaults() -> Self {
pub fn with_defaults() -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to make this pub, you can call WriterProperties::builder instead. I think it's best to keep this private rather than exposing two ways to do the same thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or impl Default for WriterPropertiesBuilder?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multi-threaded writing of Parquet files with modular encryption
4 participants