-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: Parquet modular encryption #16351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…uite, column encryption is broken.
Co-authored-by: Adam Reeve <adreeve@gmail.com>
Co-authored-by: Adam Reeve <adreeve@gmail.com>
…operties to use references.
… "." instead of "::"
2. Fixed unused header warning in config.rs. 3. Fix test case in encryption.rs to call conversion to ConfigFileDecryption properties correctly.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Add an example to read and write encrypted parquet files.
@@ -60,11 +60,11 @@ pub async fn main() -> Result<()> { | |||
Options::Cancellation(opt) => opt.run().await, | |||
Options::Clickbench(opt) => opt.run().await, | |||
Options::H2o(opt) => opt.run().await, | |||
Options::Imdb(opt) => opt.run().await, | |||
Options::Imdb(opt) => Box::pin(opt.run()).await, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requested by clippy
datafusion/common/src/config.rs
Outdated
// Any hex encoded values must be pre-encoded using | ||
// hex::encode() before calling set. | ||
if key.starts_with("column_keys_as_hex.") { | ||
let k = match key.split(".").collect::<Vec<_>>()[..] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use some feedback on how to do the column keys. Originally, I had used a separator of '::' to match what is done with metadata fields. But TableParquetOptions redirects all '::' delimitors as seen here.
https://github.com/corwinjoy/datafusion/blob/a81855fcbf3cfb63512c1ba124e1ebbfd5e6b15c/datafusion/common/src/config.rs#L2100
So, I'm not quite sure what to do here. For now, we use '.' to separate columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the encryption related settings were directly set on the TableParquetOptions
or a crypto
/encryption
namespace rather than in ParquetOptions
then I think we could avoid this issue. But then they'd probably need to be included in ParquetReadOptions
too to work with SessionContext::read_parquet
(see related comment at https://github.com/apache/datafusion/pull/16351/files#r2136718671).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened a PR against your branch that implements the suggestion to move these configuration options under a new field in TableParquetOptions
: corwinjoy#5. I think this worked quite nicely and simplified the ConfigField implementations
datafusion/common/src/config.rs
Outdated
pub column_metadata_as_hex: HashMap<String, String>, | ||
pub aad_prefix_as_hex: String, | ||
pub store_aad_prefix: bool, // default = false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create a separate Config struct, then use From methods to convert back and forth from the underlying parquet FileEncryptionProperties
.
datafusion/common/src/config.rs
Outdated
pub file_decryption_properties: Option<ConfigFileDecryptionProperties>, default = None | ||
|
||
/// Optional file encryption properties | ||
pub file_encryption_properties: Option<ConfigFileEncryptionProperties>, default = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamreeve and I are not completely sure where these settings should go. On the session context there's only a way to set the "global" ParquetOptions
but not TableParquetOptions
, which contains extra table-specific settings.
It does feel a bit wrong to put file-specific decryption properties in the execution context (see later examples). Eg. if users were reading two different encrypted Parquet files in one query they might need to set different decryption properties for each file, so setting them in the execution context wouldn't work. At the moment I think this scenario would require creating separate listing tables and specifying TableParquetOptions
. That's an edge case so maybe I'm overthinking this, but maybe being able to set file decryption properties in ParquetReadOptions
would be a good idea?
This doesn't really fit all that well with the reader options that Parquet has, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe @metesynnada or @berkaysynnada have some ideas of how to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamreeve Has a nice PR to move this all to a crypto namespace which cleans this up a lot. We are still debating a bit, since we want to understand the impact downstream for tools like delta-rs.
corwinjoy#5
datafusion/common/src/config.rs
Outdated
.unwrap(); | ||
|
||
for (i, col_name) in column_names.iter().enumerate() { | ||
let key = format!("file_encryption_properties.column_keys_as_hex.{col_name}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note use of '.' as separator for column name, as mentioned above.
@alamb One piece I would like to solicit feedback on is if there is a way to leverage the existing tests to more thoroughly vet encryption. What I mean by that, is that we uncovered a read bug when using filters in a query, and I worry that there could be other edge cases that might not be covered. What I would like to do is take an encrypted parquet file and then run the datafusion SQL tests over it (and maybe other operation tests). This would help to make sure that all the SQL operations are really covered. And maybe in addition, somehow double-check things like statistics and bloom filters? Anyway, I'm hoping there is a way to leverage the existing test suite to cover these cases. Any suggestions? |
Thank you and @adamreeve for driving so much of the modular encryption work! I'll take a look at this branch this week and see how this might get Comet supporting modular encryption within Spark, or if any obvious gaps jump out at me. |
I am sorry I haven't had a chance to review this yet. It would be great if @mbutrovich could also take a look. I have this on my list to review but I haven't been able to find the time yet |
Move encryption and decryption configuration options into a separate crypto namespace
I've been experimenting with how this work could be extended to support more ways of configuring encryption beyond having fixed and known AES keys for all files. For example, data encryption keys are often randomly generated per file in multi-file datasets, and the keys are stored encrypted in the Parquet file's encryption metadata. I've got an example of how this could work that integrates with the parquet-key-management crate in a draft PR here if anyone is interested. I've added a new This should be a follow up PR rather than part of this PR, but I think it's worth mentioning here as this will require adding a separate way to configure encryption rather than using the new |
I still owe this a look. I am traveling until July 7 unfortunately and likely won't get a chance to put it through its paces with Comet until after then (need to do some Comet work to get it working with this branch). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @corwinjoy and @adamreeve -- this PR was a joy to read and review. The code is clear, well commented, and well tested ❤️ 🏆
I think we should follow up with:
- Improve the documentation to include the format required for encryption/decryption properties
- Consider adding a
encyrption
or similar feature flag so people who don't want support for parquet encryption can avoid bringing along the dependencies
@@ -246,4 +246,72 @@ mod tests { | |||
|
|||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] | |||
async fn roundtrip_parquet_with_encryption() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this isn't in core/tests as well 🤔 (I see you are just following the existing pattern, I just noticed this while reviewing this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to move it if you think we should. As you note, I am following the pattern of the existing tests but it may fit better elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can move it in a follow on PR. -- I would sort of expect the tests to be in https://github.com/apache/datafusion/tree/main/datafusion/core/tests/dataframe
Perhaps in a file named parquet.rs
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
use datafusion::common::DataFusionError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran this example and it works great
===============================================================================
Encrypted Parquet DataFrame:
Schema:
+------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+-----------------+------------+---------------------+
| describe | id | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col | date_string_col | string_col | timestamp_col |
+------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+-----------------+------------+---------------------+
| count | 8.0 | 8 | 8.0 | 8.0 | 8.0 | 8.0 | 8.0 | 8.0 | 8 | 8 | 8 |
| null_count | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 |
| mean | 3.5 | null | 0.5 | 0.5 | 0.5 | 5.0 | 0.550000011920929 | 5.05 | null | null | null |
| std | 2.4494897427831783 | null | 0.5345224838248488 | 0.5345224838248488 | 0.5345224838248488 | 5.3452248382484875 | 0.5879747449513427 | 5.398677086630973 | null | null | null |
| min | 0.0 | null | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 01/01/09 | 0 | 2009-01-01T00:00:00 |
| max | 7.0 | null | 1.0 | 1.0 | 1.0 | 10.0 | 1.100000023841858 | 10.1 | 04/01/09 | 1 | 2009-04-01T00:01:00 |
| median | 3.0 | null | 0.0 | 0.0 | 0.0 | 5.0 | 0.550000011920929 | 5.05 | null | null | null |
+------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+-----------------+------------+---------------------+
Selected rows and columns:
+----+----------+---------------------+
| id | bool_col | timestamp_col |
+----+----------+---------------------+
| 6 | true | 2009-04-01T00:00:00 |
| 7 | false | 2009-04-01T00:01:00 |
+----+----------+---------------------+
docs/source/user-guide/configs.md
Outdated
@@ -81,6 +81,8 @@ Environment variables are read during `SessionConfig` initialisation so they mus | |||
| datafusion.execution.parquet.allow_single_file_parallelism | true | (writing) Controls whether DataFusion will attempt to speed up writing parquet files by serializing them in parallel. Each column in each row group in each output file are serialized in parallel leveraging a maximum possible core count of n_files*n_row_groups*n_columns. | | |||
| datafusion.execution.parquet.maximum_parallel_row_group_writers | 1 | (writing) By default parallel parquet writer is tuned for minimum memory usage in a streaming execution plan. You may see a performance benefit when writing large parquet files by increasing maximum_parallel_row_group_writers and maximum_buffered_record_batches_per_stream if your system has idle cores and can tolerate additional memory usage. Boosting these values is likely worthwhile when writing out already in-memory data, such as from a cached data frame. | | |||
| datafusion.execution.parquet.maximum_buffered_record_batches_per_stream | 2 | (writing) By default parallel parquet writer is tuned for minimum memory usage in a streaming execution plan. You may see a performance benefit when writing large parquet files by increasing maximum_parallel_row_group_writers and maximum_buffered_record_batches_per_stream if your system has idle cores and can tolerate additional memory usage. Boosting these values is likely worthwhile when writing out already in-memory data, such as from a cached data frame. | | |||
| datafusion.execution.parquet.file_decryption_properties | NULL | Optional file decryption properties | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to document the format of these properties -- for example mention they are hex encoded keys of whatever type, or perhaps add a link to the appropriate documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, how would you configure parquet encryption from the datafusion-cli
(
set datafusion.execution.parquet.file_decryption_properties = ???
To be clear, I don't think we need to have a super easy to configure system at first, but I do think it is important to document and point people in the right direction if they get here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is a good suggestion. We actually need to update the TableParquetOptions
docs and remove this entry since this got moved. @alamb One question. Can you suggest where to put a CLI usage example? I guess I could add something under datafusion-cli/tests/sql
. The options will look like what we have for KMS but I want to setup a running example. e.g. for the KMS we have:
let ddl = format!(
"CREATE EXTERNAL TABLE encrypted_parquet_table_2 \
STORED AS PARQUET LOCATION '{file_path}' OPTIONS (\
'format.crypto.factory_id' '{ENCRYPTION_FACTORY_ID}' \
)"
);
Which issue does this PR close?
What changes are included in this PR?
This PR adds support for encryption in DataFusion’s Parquet implementation. The changes introduce new configuration options for file encryption and decryption properties, update various components (including proto conversion, file reading/writing, and tests), and add an end-to-end encrypted Parquet example.
Are these changes tested?
Tests and examples have been added to demonstrate and test functionality versus Parquet modular encryption. These could use feedback since there may be additional DataFusion usage cases that should be covered.
Are there any user-facing changes?
Additional options have been added to allow encryption/decryption configuration. We are soliciting additional feedback on how to handle key columns in a way that best fits the existing API.
Catalog of changes via copilot
Show a summary per file