Skip to content

fix: support nullable columns in pre-sorted data sources #16783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 17, 2025

Conversation

crepererum
Copy link
Contributor

@crepererum crepererum commented Jul 15, 2025

Which issue does this PR close?

-

Rationale for this change

There are valid cases where files are pre-sorted on nullable columns. Arrow and DataFusion have proper semantic for that.

What changes are included in this PR?

Removes an exception / special-handling for nullable columns. The Arrow row encoder that is used in the respective code fully supports nullable columns (or at least it will cleanly fail if it doesn't), INCLUDING nulls-first&last.

Are these changes tested?

  • regression test (added in 1st commit, fixed in 2nd commit)
  • extended/changed unit test for split_groups_by_statistics

Are there any user-facing changes?

Some queries are now faster.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Jul 15, 2025
@crepererum crepererum force-pushed the crepererum/handle-null-cols-in-files branch from b5fd354 to e5c1777 Compare July 15, 2025 13:17
@@ -130,8 +130,7 @@ STORED AS PARQUET;
----
3

# Check output plan again, expect no "output_ordering" clause in the physical_plan -> ParquetExec,
# due to there being more files than partitions:
# Check output plan again
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual reason why the output_ordering was missing here wasn't the number of files, but because DF had trouble with a column that wasn't marked as NOT NULL (i.e. is nullable).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I reviewed the test and the table definition explicitly says

WITH ORDER (string_col ASC NULLS LAST, int_col ASC NULLS LAST)

So I would expect this plan not to have additional sorting

@crepererum crepererum marked this pull request as ready for review July 15, 2025 13:42
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @crepererum -- I reviewed the code and test changes carefully and I think they all make sense

@@ -130,8 +130,7 @@ STORED AS PARQUET;
----
3

# Check output plan again, expect no "output_ordering" clause in the physical_plan -> ParquetExec,
# due to there being more files than partitions:
# Check output plan again
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I reviewed the test and the table definition explicitly says

WITH ORDER (string_col ASC NULLS LAST, int_col ASC NULLS LAST)

So I would expect this plan not to have additional sorting

@@ -230,14 +230,7 @@ impl MinMaxStatistics {
.zip(sort_columns.iter().copied())
.map(|(sort_expr, column)| {
let schema = values.schema();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be the only actual code change: remove these lines

git praise says they came in via #9593 from @suremarc . Do you remember why this condition was added @suremarc ?

https://github.com/apache/datafusion/blame/a614716e7d97ff1d3374aef31b9a66fd10141423/datafusion/datasource/src/statistics.rs#L238

crepererum added a commit to influxdata/arrow-datafusion that referenced this pull request Jul 17, 2025
crepererum added a commit to influxdata/arrow-datafusion that referenced this pull request Jul 17, 2025
crepererum added a commit to influxdata/arrow-datafusion that referenced this pull request Jul 17, 2025
@alamb alamb merged commit 2a33c87 into apache:main Jul 17, 2025
28 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 17, 2025

Thanks @crepererum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants