-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Set the default value of datafusion.execution.collect_statistics
to true
#16447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think we should run some benchmarks here to see if the regression is gone |
Definitely, I'll run our benchmarks once I get all tests passing here. |
I'm getting a lot of sqllogictest failures, is there a reason to think there something weird going on? I was somewhat open to the idea its all fine until I ran into the last test in I'll push all of them soon so others can also take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @AdamGS !
I think the reason that so many of the plans changed is that there is rule that skips repartitioning input files if they are past some threshold (repartition_file_min_size
) and now that the statistics are enabled by default this code is triggered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @AdamGS -- this looks great to me and I think the plan changes are also good / reasonable.
I think we should also add a note to the upgrade guide about this change: https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md
FYI @davisp
|
||
# cleanup | ||
statement ok | ||
DROP TABLE test_table; | ||
|
||
###### | ||
# When the setting is true, the statistics are gathered | ||
# When the setting is true, statistics are gathered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to other reviewers, the negative is still tested below:
When the setting is false, the statistics are NOT gathered
Added a short upgrade note |
Our benchmarks show this change fixes the performance regression we saw - vortex-data/vortex#3567 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR changes the default of datafusion.execution.collect_statistics
from false
to true
, updating documentation, configuration defaults, upgrade notes, and test expectations to match the new behavior.
- Update user‐guide and config docs to reflect that statistics are now gathered by default.
- Amend upgrade guide and example override to the new default.
- Adjust code comments, configuration struct, and tests (unit and SQL logic tests) for default‐true behavior.
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
docs/source/user-guide/sql/ddl.md | Note updated to describe reading files by default |
docs/source/user-guide/configs.md | Default value for collect_statistics set to true |
docs/source/library-user-guide/upgrading.md | Upgrade notes and example override updated |
datafusion/sqllogictest/test_files/repartition.slt | Physical plan ordering updated |
datafusion/sqllogictest/test_files/parquet_statistics.slt | Expected stats presence in plan updated |
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt | Removed redundant nodes, fixed spelling |
datafusion/sqllogictest/test_files/limit.slt | Stripped byte-range suffixes from file paths |
datafusion/sqllogictest/test_files/information_schema.slt | Schema output default toggled to true |
datafusion/sqllogictest/test_files/explain_tree.slt | Regenerated explain output formatting |
datafusion/core/tests/parquet/row_group_pruning.rs | Adjusted expected pruned_by_stats counts |
datafusion/core/src/execution/context/parquet.rs | Doc comments and test assertions updated |
datafusion/common/src/config.rs | Config default and doc comment updated |
Comments suppressed due to low confidence (2)
docs/source/library-user-guide/upgrading.md:30
- [nitpick] The sentence is incomplete. Consider rephrasing to clarify what is 'previous' (e.g., "restores the previous default behavior of
ListingTable
").
This change also restores the default behavior of `ListingTable` to its previous. If you use it directly
docs/source/library-user-guide/upgrading.md:36
- Verify that the builder method name matches the actual API. If it's named
with_collect_statistics
rather thanwith_collect_stat
, update this example accordingly.
.with_collect_stat(false)
This is required for apache/datafusion#16447 to take effect (once its merged). By default datafusion will take care of that if you use the SQL interface (and do all the session state setup). Signed-off-by: Adam Gutglick <adam@spiraldb.com>
This is required for apache/datafusion#16447 to take effect (once its merged). By default datafusion will take care of that if you use the SQL interface (and do all the session state setup). Signed-off-by: Adam Gutglick <adam@spiraldb.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?