fix: support nullable columns in pre-sorted data sources #16783

crepererum · 2025-07-15T11:04:00Z

Which issue does this PR close?

-

Rationale for this change

There are valid cases where files are pre-sorted on nullable columns. Arrow and DataFusion have proper semantic for that.

What changes are included in this PR?

Removes an exception / special-handling for nullable columns. The Arrow row encoder that is used in the respective code fully supports nullable columns (or at least it will cleanly fail if it doesn't), INCLUDING nulls-first&last.

Are these changes tested?

regression test (added in 1st commit, fixed in 2nd commit)
extended/changed unit test for split_groups_by_statistics

Are there any user-facing changes?

Some queries are now faster.

crepererum · 2025-07-15T13:22:20Z

datafusion/sqllogictest/test_files/parquet.slt

@@ -130,8 +130,7 @@ STORED AS PARQUET;
 ----
 3

-# Check output plan again, expect no "output_ordering" clause in the physical_plan -> ParquetExec,
-# due to there being more files than partitions:
+# Check output plan again


I think the actual reason why the output_ordering was missing here wasn't the number of files, but because DF had trouble with a column that wasn't marked as NOT NULL (i.e. is nullable).

I agree -- I reviewed the test and the table definition explicitly says

WITH ORDER (string_col ASC NULLS LAST, int_col ASC NULLS LAST)

So I would expect this plan not to have additional sorting

alamb

Thank you @crepererum -- I reviewed the code and test changes carefully and I think they all make sense

alamb · 2025-07-15T20:27:30Z

datafusion/sqllogictest/test_files/parquet.slt

@@ -130,8 +130,7 @@ STORED AS PARQUET;
 ----
 3

-# Check output plan again, expect no "output_ordering" clause in the physical_plan -> ParquetExec,
-# due to there being more files than partitions:
+# Check output plan again


I agree -- I reviewed the test and the table definition explicitly says

WITH ORDER (string_col ASC NULLS LAST, int_col ASC NULLS LAST)

So I would expect this plan not to have additional sorting

alamb · 2025-07-15T20:30:59Z

datafusion/datasource/src/statistics.rs

@@ -230,14 +230,7 @@ impl MinMaxStatistics {
                .zip(sort_columns.iter().copied())
                .map(|(sort_expr, column)| {
                    let schema = values.schema();
-


This appears to be the only actual code change: remove these lines

git praise says they came in via #9593 from @suremarc . Do you remember why this condition was added @suremarc ?

https://github.com/apache/datafusion/blame/a614716e7d97ff1d3374aef31b9a66fd10141423/datafusion/datasource/src/statistics.rs#L238

alamb · 2025-07-17T21:31:37Z

Thanks @crepererum

test: add regression test

9bf6bc4

github-actions bot added sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Jul 15, 2025

fix: support nullable columns in pre-sorted data sources

e5c1777

crepererum force-pushed the crepererum/handle-null-cols-in-files branch from b5fd354 to e5c1777 Compare July 15, 2025 13:17

crepererum commented Jul 15, 2025

View reviewed changes

crepererum marked this pull request as ready for review July 15, 2025 13:42

alamb approved these changes Jul 15, 2025

View reviewed changes

crepererum added a commit to influxdata/arrow-datafusion that referenced this pull request Jul 17, 2025

fix: support nullable columns in pre-sorted data sources (apache#16783)

f33e866

crepererum mentioned this pull request Jul 17, 2025

Patched DF 48.0.1 (take 1) influxdata/arrow-datafusion#69

Draft

crepererum added a commit to influxdata/arrow-datafusion that referenced this pull request Jul 17, 2025

fix: support nullable columns in pre-sorted data sources (apache#16783)

7873b5b

crepererum added a commit to influxdata/arrow-datafusion that referenced this pull request Jul 17, 2025

fix: support nullable columns in pre-sorted data sources (apache#16783)

3670b67

alamb merged commit 2a33c87 into apache:main Jul 17, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: support nullable columns in pre-sorted data sources #16783

fix: support nullable columns in pre-sorted data sources #16783

Uh oh!

crepererum commented Jul 15, 2025 •

edited

Loading

Uh oh!

crepererum Jul 15, 2025

Uh oh!

alamb Jul 15, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Jul 15, 2025

Uh oh!

alamb Jul 15, 2025

Uh oh!

Uh oh!

alamb commented Jul 17, 2025

Uh oh!

Uh oh!

fix: support nullable columns in pre-sorted data sources #16783

fix: support nullable columns in pre-sorted data sources #16783

Uh oh!

Conversation

crepererum commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

crepererum Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Jul 17, 2025

Uh oh!

Uh oh!

crepererum commented Jul 15, 2025 •

edited

Loading