-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat: support multi-threaded writing of Parquet files with modular encryption #16738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Note: this includes some unrelated changes so as to be able to use changes in apache/arrow-rs#7818. @adamreeve @corwinjoy thought? |
b7e1398
to
d8709fc
Compare
d8709fc
to
bb475bf
Compare
8e872c3
to
5929724
Compare
5929724
to
f1f6d63
Compare
This approach looks good to me! Do the existing tests hit the parallel write code path? We probably want to make sure there are encryption tests for both the parallel and serial code paths. |
Via copilot: Dependency Updates:
Feature Enhancements:
Code Cleanup:
Protobuf Updates:
|
max_buffer_rb, | ||
&pool, | ||
)?; | ||
let mut current_rg_rows = 0; | ||
// TODO: row_group_writer should use the correct row group index. Currently this would fail if | ||
// multiple row groups were written. | ||
// let mut rg_index = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamreeve Definitely something to note. We will want to resolve this before the final PR.
Looks good to me, with the exception of multi-row group writing being missing. When you go to rebase to the latest datafusion the diff should get a lot simpler since they have upgraded to support arrow-55.3 and added the missing statitistics and types. I also like @adamreeve suggestion of having a test for the parallel write path. |
launch_serialization_task | ||
.join_unwind() | ||
.await | ||
.map_err(|e| DataFusionError::ExecutionJoin(Box::new(e)))??; | ||
Ok(file_metadata) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot comments:
- Row Group Indexing: The TODO comment indicates that the row group writer currently assumes a single row group index (always 0). If multiple row groups are written in parallel, this might cause incorrect output or data corruption. Consider addressing this before merging.
- Encryption Parallelism: The removal of the explicit check for parallelism when encryption is enabled relies on arrow-rs’ internal handling. Double-check upstream to ensure there’s a clear error or fallback if parallelism is not supported with encryption.
- Testing: These changes affect core serialization logic. Ensure thorough integration and performance tests, especially for edge cases (encryption, large datasets, multiple row groups).
- Documentation: Consider updating related docs or comments, as the code flow and APIs used have changed significantly.
Which issue does this PR close?
Rationale for this change
#16351 added modular encryption reading and writing. This builds on top of #16351 and uses apache/arrow-rs#7818 to enable multi threaded encrypted writing.
What changes are included in this PR?
This uses a lower level
ArrowRowGroupWriterFactory
API to create encryption-aware column writers that are then run multi threaded.Are these changes tested?
Yes.
Are there any user-facing changes?
Previously
parquet_opts.global.allow_single_file_parallelism == true
would be ignored and encrypted write would always be single threaded. Now it will run multithreaded.