feat: Parquet modular encryption #16351

corwinjoy · 2025-06-10T00:31:19Z

Which issue does this PR close?

Closes Support integration with Parquet modular encryption #15216.

What changes are included in this PR?

This PR adds support for encryption in DataFusion’s Parquet implementation. The changes introduce new configuration options for file encryption and decryption properties, update various components (including proto conversion, file reading/writing, and tests), and add an end-to-end encrypted Parquet example.

Are these changes tested?

Tests and examples have been added to demonstrate and test functionality versus Parquet modular encryption. These could use feedback since there may be additional DataFusion usage cases that should be covered.

Are there any user-facing changes?

Additional options have been added to allow encryption/decryption configuration. We are soliciting additional feedback on how to handle key columns in a way that best fits the existing API.

Catalog of changes via copilot

Show a summary per file

File	Description
docs/source/user-guide/configs.md	Added documentation for new encryption and decryption properties.
datafusion/sqllogictest/test_files/information_schema.slt	Updated SQL logic tests to include encryption/decryption settings.
datafusion/proto/src/logical_plan/file_formats.rs	Added default None values for encryption properties in proto.
datafusion/proto-common/src/from_proto/mod.rs	Updated conversion to set encryption properties to None.
datafusion/datasource-parquet/src/source.rs	Passed file decryption properties from table options.
datafusion/datasource-parquet/src/opener.rs	Disabled page index when encryption is enabled.
datafusion/datasource-parquet/src/file_format.rs	Propagated decryption properties in schema and statistics fetching.
datafusion/core/tests/parquet/mod.rs	Added the encryption test module reference.
datafusion/core/tests/parquet/encryption.rs	Added tests for round-trip encryption.
datafusion/core/tests/parquet/custom_reader.rs	Updated custom reader to pass None for decryption properties.
datafusion/core/src/datasource/file_format/parquet.rs	Added extra parameters for decryption in metadata/statistics fetching.
datafusion/core/src/dataframe/parquet.rs	Added roundtrip test for encrypted Parquet files.
datafusion/common/src/file_options/parquet_writer.rs	Updated writer options to support encryption properties.
datafusion/common/Cargo.toml	Added the hex dependency.
datafusion-examples/examples/parquet_encrypted.rs	Introduced an example that reads/writes encrypted Parquet files.
datafusion-examples/README.md	Updated examples list to include the encrypted Parquet demo.
Cargo.toml	Enabled encryption feature for the parquet crate.

…uite, column encryption is broken.

Co-authored-by: Adam Reeve <adreeve@gmail.com>

…operties to use references.

…nfig fields.

… "." instead of "::"

2. Fixed unused header warning in config.rs. 3. Fix test case in encryption.rs to call conversion to ConfigFileDecryption properties correctly.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…_encryption

Add an example to read and write encrypted parquet files.

…ing with filter.

corwinjoy · 2025-06-10T00:31:58Z

@adamreeve @rok

corwinjoy · 2025-06-10T00:32:30Z

benchmarks/src/bin/dfbench.rs

@@ -60,11 +60,11 @@ pub async fn main() -> Result<()> {
        Options::Cancellation(opt) => opt.run().await,
        Options::Clickbench(opt) => opt.run().await,
        Options::H2o(opt) => opt.run().await,
-        Options::Imdb(opt) => opt.run().await,
+        Options::Imdb(opt) => Box::pin(opt.run()).await,


requested by clippy

corwinjoy · 2025-06-10T00:38:46Z

datafusion/common/src/config.rs

+        // Any hex encoded values must be pre-encoded using
+        // hex::encode() before calling set.
+        if key.starts_with("column_keys_as_hex.") {
+            let k = match key.split(".").collect::<Vec<_>>()[..] {


We could use some feedback on how to do the column keys. Originally, I had used a separator of '::' to match what is done with metadata fields. But TableParquetOptions redirects all '::' delimitors as seen here.
https://github.com/corwinjoy/datafusion/blob/a81855fcbf3cfb63512c1ba124e1ebbfd5e6b15c/datafusion/common/src/config.rs#L2100

So, I'm not quite sure what to do here. For now, we use '.' to separate columns.

If the encryption related settings were directly set on the TableParquetOptions or a crypto/encryption namespace rather than in ParquetOptions then I think we could avoid this issue. But then they'd probably need to be included in ParquetReadOptions too to work with SessionContext::read_parquet (see related comment at https://github.com/apache/datafusion/pull/16351/files#r2136718671).

I've opened a PR against your branch that implements the suggestion to move these configuration options under a new field in TableParquetOptions: corwinjoy#5. I think this worked quite nicely and simplified the ConfigField implementations

corwinjoy · 2025-06-10T00:40:41Z

datafusion/common/src/config.rs

+    pub column_metadata_as_hex: HashMap<String, String>,
+    pub aad_prefix_as_hex: String,
+    pub store_aad_prefix: bool, //  default = false
+}


We create a separate Config struct, then use From methods to convert back and forth from the underlying parquet FileEncryptionProperties.

corwinjoy · 2025-06-10T00:44:48Z

datafusion/common/src/config.rs

+        pub file_decryption_properties: Option<ConfigFileDecryptionProperties>, default = None
+
+        /// Optional file encryption properties
+        pub file_encryption_properties: Option<ConfigFileEncryptionProperties>, default = None


@adamreeve and I are not completely sure where these settings should go. On the session context there's only a way to set the "global" ParquetOptions but not TableParquetOptions, which contains extra table-specific settings.

It does feel a bit wrong to put file-specific decryption properties in the execution context (see later examples). Eg. if users were reading two different encrypted Parquet files in one query they might need to set different decryption properties for each file, so setting them in the execution context wouldn't work. At the moment I think this scenario would require creating separate listing tables and specifying TableParquetOptions. That's an edge case so maybe I'm overthinking this, but maybe being able to set file decryption properties in ParquetReadOptions would be a good idea?

This doesn't really fit all that well with the reader options that Parquet has, though.

maybe @metesynnada or @berkaysynnada have some ideas of how to do this

@adamreeve Has a nice PR to move this all to a crypto namespace which cleans this up a lot. We are still debating a bit, since we want to understand the impact downstream for tools like delta-rs.
corwinjoy#5

corwinjoy · 2025-06-10T00:45:50Z

datafusion/common/src/config.rs

+            .unwrap();
+
+        for (i, col_name) in column_names.iter().enumerate() {
+            let key = format!("file_encryption_properties.column_keys_as_hex.{col_name}");


Note use of '.' as separator for column name, as mentioned above.

datafusion/datasource-parquet/src/file_format.rs

corwinjoy · 2025-06-12T22:44:40Z

@alamb One piece I would like to solicit feedback on is if there is a way to leverage the existing tests to more thoroughly vet encryption. What I mean by that, is that we uncovered a read bug when using filters in a query, and I worry that there could be other edge cases that might not be covered. What I would like to do is take an encrypted parquet file and then run the datafusion SQL tests over it (and maybe other operation tests). This would help to make sure that all the SQL operations are really covered. And maybe in addition, somehow double-check things like statistics and bloom filters? Anyway, I'm hoping there is a way to leverage the existing test suite to cover these cases. Any suggestions?

mbutrovich · 2025-06-16T19:29:11Z

Thank you and @adamreeve for driving so much of the modular encryption work! I'll take a look at this branch this week and see how this might get Comet supporting modular encryption within Spark, or if any obvious gaps jump out at me.

alamb · 2025-06-16T21:59:05Z

I am sorry I haven't had a chance to review this yet. It would be great if @mbutrovich could also take a look. I have this on my list to review but I haven't been able to find the time yet

Move encryption and decryption configuration options into a separate crypto namespace

adamreeve · 2025-06-24T03:48:10Z

I've been experimenting with how this work could be extended to support more ways of configuring encryption beyond having fixed and known AES keys for all files. For example, data encryption keys are often randomly generated per file in multi-file datasets, and the keys are stored encrypted in the Parquet file's encryption metadata. I've got an example of how this could work that integrates with the parquet-key-management crate in a draft PR here if anyone is interested.

I've added a new EncryptionFactory trait for dynamically generating file encryption and decryption properties, and used a registry of these in the runtime environment to allow identifying the encryption factory with a string identifier for compatibility with string based configuration.

This should be a follow up PR rather than part of this PR, but I think it's worth mentioning here as this will require adding a separate way to configure encryption rather than using the new ConfigFileDecryptionProperties and ConfigFileEncryptionProperties types in this PR. In theory, using fixed AES keys could be implemented with an EncryptionFactory implementation, but the configuration for this is a bit clunky and opaque, so I think it makes sense to have more direct support for this simple scenario.

mbutrovich · 2025-06-24T12:19:53Z

I am sorry I haven't had a chance to review this yet. It would be great if @mbutrovich could also take a look. I have this on my list to review but I haven't been able to find the time yet

I still owe this a look. I am traveling until July 7 unfortunately and likely won't get a chance to put it through its paces with Comet until after then (need to do some Comet work to get it working with this branch).

alamb

Thank you @corwinjoy and @adamreeve -- this PR was a joy to read and review. The code is clear, well commented, and well tested ❤️ 🏆

I think we should follow up with:

Improve the documentation to include the format required for encryption/decryption properties
Consider adding a encyrption or similar feature flag so people who don't want support for parquet encryption can avoid bringing along the dependencies

alamb · 2025-06-24T20:11:04Z

datafusion/core/src/dataframe/parquet.rs

@@ -246,4 +246,72 @@ mod tests {

        Ok(())
    }
+
+    #[tokio::test]
+    async fn roundtrip_parquet_with_encryption() -> Result<()> {


I wonder why this isn't in core/tests as well 🤔 (I see you are just following the existing pattern, I just noticed this while reviewing this PR)

I'm happy to move it if you think we should. As you note, I am following the pattern of the existing tests but it may fit better elsewhere.

Maybe we can move it in a follow on PR. -- I would sort of expect the tests to be in https://github.com/apache/datafusion/tree/main/datafusion/core/tests/dataframe

Perhaps in a file named parquet.rs

alamb · 2025-06-24T20:12:42Z

datafusion-examples/examples/parquet_encrypted.rs

+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion::common::DataFusionError;


I ran this example and it works great

=============================================================================== Encrypted Parquet DataFrame: Schema: +------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+-----------------+------------+---------------------+ | describe | id | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col | date_string_col | string_col | timestamp_col | +------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+-----------------+------------+---------------------+ | count | 8.0 | 8 | 8.0 | 8.0 | 8.0 | 8.0 | 8.0 | 8.0 | 8 | 8 | 8 | | null_count | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | | mean | 3.5 | null | 0.5 | 0.5 | 0.5 | 5.0 | 0.550000011920929 | 5.05 | null | null | null | | std | 2.4494897427831783 | null | 0.5345224838248488 | 0.5345224838248488 | 0.5345224838248488 | 5.3452248382484875 | 0.5879747449513427 | 5.398677086630973 | null | null | null | | min | 0.0 | null | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 01/01/09 | 0 | 2009-01-01T00:00:00 | | max | 7.0 | null | 1.0 | 1.0 | 1.0 | 10.0 | 1.100000023841858 | 10.1 | 04/01/09 | 1 | 2009-04-01T00:01:00 | | median | 3.0 | null | 0.0 | 0.0 | 0.0 | 5.0 | 0.550000011920929 | 5.05 | null | null | null | +------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+-----------------+------------+---------------------+ Selected rows and columns: +----+----------+---------------------+ | id | bool_col | timestamp_col | +----+----------+---------------------+ | 6 | true | 2009-04-01T00:00:00 | | 7 | false | 2009-04-01T00:01:00 | +----+----------+---------------------+

alamb · 2025-06-24T20:17:15Z

docs/source/user-guide/configs.md

@@ -81,6 +81,8 @@ Environment variables are read during `SessionConfig` initialisation so they mus
 | datafusion.execution.parquet.allow_single_file_parallelism              | true                      | (writing) Controls whether DataFusion will attempt to speed up writing parquet files by serializing them in parallel. Each column in each row group in each output file are serialized in parallel leveraging a maximum possible core count of n_files*n_row_groups*n_columns.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
 | datafusion.execution.parquet.maximum_parallel_row_group_writers         | 1                         | (writing) By default parallel parquet writer is tuned for minimum memory usage in a streaming execution plan. You may see a performance benefit when writing large parquet files by increasing maximum_parallel_row_group_writers and maximum_buffered_record_batches_per_stream if your system has idle cores and can tolerate additional memory usage. Boosting these values is likely worthwhile when writing out already in-memory data, such as from a cached data frame.                                                                                                                                                                                                                                                                                                                                                                                                                                          |
 | datafusion.execution.parquet.maximum_buffered_record_batches_per_stream | 2                         | (writing) By default parallel parquet writer is tuned for minimum memory usage in a streaming execution plan. You may see a performance benefit when writing large parquet files by increasing maximum_parallel_row_group_writers and maximum_buffered_record_batches_per_stream if your system has idle cores and can tolerate additional memory usage. Boosting these values is likely worthwhile when writing out already in-memory data, such as from a cached data frame.                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+| datafusion.execution.parquet.file_decryption_properties                 | NULL                      | Optional file decryption properties                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |


It would be nice to document the format of these properties -- for example mention they are hex encoded keys of whatever type, or perhaps add a link to the appropriate documentation

For example, how would you configure parquet encryption from the datafusion-cli (

set datafusion.execution.parquet.file_decryption_properties = ???

To be clear, I don't think we need to have a super easy to configure system at first, but I do think it is important to document and point people in the right direction if they get here

Yes. This is a good suggestion. We actually need to update the TableParquetOptions docs and remove this entry since this got moved. @alamb One question. Can you suggest where to put a CLI usage example? I guess I could add something under datafusion-cli/tests/sql. The options will look like what we have for KMS but I want to setup a running example. e.g. for the KMS we have:

let ddl = format!( "CREATE EXTERNAL TABLE encrypted_parquet_table_2 \ STORED AS PARQUET LOCATION '{file_path}' OPTIONS (\ 'format.crypto.factory_id' '{ENCRYPTION_FACTORY_ID}' \ )" );

corwinjoy and others added 30 commits May 29, 2025 19:04

Initial commit to form PR for datafusion encryption support

e668b99

Add tests for encryption configuration

d38dba4

Apply cargo fmt

5a2b456

Add a roundtrip encryption test to the parquet tests.

c972676

cargo fmt

ec3f828

Update test to add decryption parameter to called functions.

3538a27

Try to get DataFrame.write_parquet to work with encryption. Doesn't q…

a754992

…uite, column encryption is broken.

Update datafusion/datasource-parquet/src/opener.rs

e430672

Co-authored-by: Adam Reeve <adreeve@gmail.com>

Update datafusion/datasource-parquet/src/source.rs

7fcba70

Co-authored-by: Adam Reeve <adreeve@gmail.com>

Fix write test in parquet.rs

d6b1fca

Simplify encryption test. Remove unused imports.

3353186

Run cargo fmt.

e4bc0e3

Further streamline roundtrip test.

f52e79c

Change From methods for FileEncryptionProperties and FileDecryptionPr…

5615ac8

…operties to use references.

Change encryption config to directly hold column keys using custom co…

61bc78e

…nfig fields.

Fix generated field names in visit for encryptor and decryptor to use…

a81855f

… "." instead of "::"

1. Disable parallel writes with enccryption.

4cf12b3

2. Fixed unused header warning in config.rs. 3. Fix test case in encryption.rs to call conversion to ConfigFileDecryption properties correctly.

cargo fmt

f29bec3

Update datafusion/common/src/file_options/parquet_writer.rs

86fe04b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix variables shown in information schema test.

d4ea63f

Merge remote-tracking branch 'origin/parquet_encryption' into parquet…

0fcc4a5

…_encryption

Backout bad suggestion from copilot

86db3a5

Remove unused serde reference

b34441a

Add an example to read and write encrypted parquet files.

cargo fmt

668d728

change file_format.rs to use global encryption options in struct.

ec1e8da

Turn off page_index for encrypted example. Get encrypted example work…

e233408

…ing with filter.

Tidy up example output.

9ffaae4

Add missing license. Run taplo format

8e244e9

Update configs.md by running dev/update_config_docs.sh

2871d51

Cargo fmt + clippy changes.

c405167

corwinjoy added 3 commits June 9, 2025 16:48

Add filter test for encrypted files.

506801e

Cargo clippy changes.

3058a90

Merge remote-tracking branch 'origin/main' into parquet_encryption

e7e521a

github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Jun 10, 2025

corwinjoy commented Jun 10, 2025

View reviewed changes

datafusion/datasource-parquet/src/file_format.rs Show resolved Hide resolved

corwinjoy added 2 commits June 9, 2025 18:02

Fix link in README.md

bbeecfe

Add issue tag for parallel writes.

4ceb072

adamreeve added 2 commits June 16, 2025 16:48

Move file encryption and decryption properties out of global options

c998378

Use config_namespace_with_hashmap for column encryption/decryption props

7780b33

corwinjoy mentioned this pull request Jun 16, 2025

Feature: Support Parquet Modular Encryption delta-io/delta-rs#3300

Open

Merge pull request #5 from adamreeve/crypto_config_namespace

219d0b3

Move encryption and decryption configuration options into a separate crypto namespace

alamb approved these changes Jun 24, 2025

View reviewed changes

Remove outdated docs on crypto settings.

2682292

github-actions bot removed the documentation Improvements or additions to documentation label Jun 25, 2025

feat: Parquet modular encryption #16351

Are you sure you want to change the base?

feat: Parquet modular encryption #16351

Conversation

corwinjoy commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Catalog of changes via copilot

Uh oh!

corwinjoy commented Jun 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corwinjoy Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

corwinjoy commented Jun 12, 2025

Uh oh!

mbutrovich commented Jun 16, 2025

Uh oh!

alamb commented Jun 16, 2025

Uh oh!

adamreeve commented Jun 24, 2025

Uh oh!

mbutrovich commented Jun 24, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

corwinjoy commented Jun 10, 2025 •

edited

Loading

corwinjoy Jun 10, 2025 •

edited

Loading