-
Notifications
You must be signed in to change notification settings - Fork 1.6k
fix: support nullable columns in pre-sorted data sources #16783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support nullable columns in pre-sorted data sources #16783
Conversation
b5fd354
to
e5c1777
Compare
@@ -130,8 +130,7 @@ STORED AS PARQUET; | |||
---- | |||
3 | |||
|
|||
# Check output plan again, expect no "output_ordering" clause in the physical_plan -> ParquetExec, | |||
# due to there being more files than partitions: | |||
# Check output plan again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the actual reason why the output_ordering
was missing here wasn't the number of files, but because DF had trouble with a column that wasn't marked as NOT NULL
(i.e. is nullable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- I reviewed the test and the table definition explicitly says
WITH ORDER (string_col ASC NULLS LAST, int_col ASC NULLS LAST)
So I would expect this plan not to have additional sorting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @crepererum -- I reviewed the code and test changes carefully and I think they all make sense
@@ -130,8 +130,7 @@ STORED AS PARQUET; | |||
---- | |||
3 | |||
|
|||
# Check output plan again, expect no "output_ordering" clause in the physical_plan -> ParquetExec, | |||
# due to there being more files than partitions: | |||
# Check output plan again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- I reviewed the test and the table definition explicitly says
WITH ORDER (string_col ASC NULLS LAST, int_col ASC NULLS LAST)
So I would expect this plan not to have additional sorting
@@ -230,14 +230,7 @@ impl MinMaxStatistics { | |||
.zip(sort_columns.iter().copied()) | |||
.map(|(sort_expr, column)| { | |||
let schema = values.schema(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @crepererum |
Which issue does this PR close?
-
Rationale for this change
There are valid cases where files are pre-sorted on nullable columns. Arrow and DataFusion have proper semantic for that.
What changes are included in this PR?
Removes an exception / special-handling for nullable columns. The Arrow row encoder that is used in the respective code fully supports nullable columns (or at least it will cleanly fail if it doesn't), INCLUDING nulls-first&last.
Are these changes tested?
split_groups_by_statistics
Are there any user-facing changes?
Some queries are now faster.