-
Notifications
You must be signed in to change notification settings - Fork 976
Support multi-threaded writing of Parquet files with modular encryption #7818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ee38fb5
to
cd5196f
Compare
cd5196f
to
c2ac2d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I like the approach of exposing the ArrowRowGroupWriterFactory
to handle passing through the file encryption properties. We should probably also update the documentation on ArrowColumnWriter
to encourage users to use this API, and/or add documentation to get_column_writers
to encourage users to use this approach for compatibility with encryption.
There seem to be a lot of extra things that have been exposed publicly in this PR that don't need to be though, which makes it hard to follow what changes are actually needed. We should minimize how much is public so it's easier for users to follow documentation and see which APIs they should be using, and so that internal details can be changed later.
Co-authored-by: Adam Reeve <adreeve@gmail.com>
86ddf45
to
b35d078
Compare
b35d078
to
2d42ca0
Compare
Thanks for the quick review @adamreeve ! |
bc293de
to
4009f42
Compare
4009f42
to
b57a5d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good thanks Rok, I've added a few more comments
I've opened a draft PR against datafusion to check if this is usable for multi-threaded writes and it does seem to be. Switching this to ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rok and @adamreeve -- this is very exciting.
I have one question about API design but otherwise the basic idea looks great to me
@@ -755,7 +755,7 @@ impl ArrowColumnWriter { | |||
} | |||
|
|||
/// Encodes [`RecordBatch`] to a parquet row group | |||
struct ArrowRowGroupWriter { | |||
pub struct ArrowRowGroupWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we can't use the existing low level APIs documented here: https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel
In other words, do we really need to make what has previously been an implementation detail part of the public API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick response @alamb ! :)
Looking at the example - we have to pass a single private FileEncryptor
object (it has random state) to all ArrowColumnWriter
s, SerializedRowGroupWriter
s and SerializedFileWriter
. Perhaps we could hide everything into ArrowWriter
. I'll have to do a quick check. If you have a cleaner idea we'd be most interested!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think avoiding exposing ArrowRowGroupWriter and the associated machinery that would be good
The rationale is that @XiangpengHao and @zhuqi-lucas and myself are likely to be reworking it in the next few releases and once it is part of the public API changing it becomes harder as it requires more coordination / backwards compatibility concerns
@@ -457,7 +457,7 @@ pub struct WriterPropertiesBuilder { | |||
|
|||
impl WriterPropertiesBuilder { | |||
/// Returns default state of the builder. | |||
fn with_defaults() -> Self { | |||
pub fn with_defaults() -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to make this pub
, you can call WriterProperties::builder
instead. I think it's best to keep this private rather than exposing two ways to do the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or impl Default for WriterPropertiesBuilder
?
Closes #7359.
Rationale for this change
This is to enable concurrent column writing with encryption downstream (e.g. with datafusion). See #7359 for more.
See https://github.com/apache/arrow-rs/pull/7111/files#r2015196618
What changes are included in this PR?
ArrowRowGroupWriterFactory.create_row_group_writer
now passes aFileEncryptor
toArrowColumnWriterFactory
that can later be used to write columns concurrently.pub
.Are these changes tested?
Yes.
Are there any user-facing changes?
ArrowRowGroupWriterFactory
,ArrowRowGroupWriter
,ArrowRowGroupWriterFactory::new
,ArrowRowGroupWriterFactory.create_row_group_writer
, .. are nowpub
. Shouldn't be breaking but perhaps needs review.