Skip to content

Set the default value of datafusion.execution.collect_statistics to true #16447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 19, 2025

Conversation

AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Jun 18, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Jun 18, 2025
@jonathanc-n
Copy link
Contributor

I think we should run some benchmarks here to see if the regression is gone

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 18, 2025

Definitely, I'll run our benchmarks once I get all tests passing here.

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 18, 2025

I'm getting a lot of sqllogictest failures, is there a reason to think there something weird going on? I was somewhat open to the idea its all fine until I ran into the last test in union_by_name.slt suddenly passing.

I'll push all of them soon so others can also take a look.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AdamGS !

I think the reason that so many of the plans changed is that there is rule that skips repartitioning input files if they are past some threshold (repartition_file_min_size) and now that the statistics are enabled by default this code is triggered

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @AdamGS -- this looks great to me and I think the plan changes are also good / reasonable.

I think we should also add a note to the upgrade guide about this change: https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md

FYI @davisp


# cleanup
statement ok
DROP TABLE test_table;

######
# When the setting is true, the statistics are gathered
# When the setting is true, statistics are gathered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to other reviewers, the negative is still tested below:

When the setting is false, the statistics are NOT gathered

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 18, 2025

Added a short upgrade note

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 18, 2025

Got a similar test failure to #16448 (issue filed in #16452). I have to conclude its personal at this point, I'll try and find some time to dig into it.

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 19, 2025

Our benchmarks show this change fixes the performance regression we saw - vortex-data/vortex#3567

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR changes the default of datafusion.execution.collect_statistics from false to true, updating documentation, configuration defaults, upgrade notes, and test expectations to match the new behavior.

  • Update user‐guide and config docs to reflect that statistics are now gathered by default.
  • Amend upgrade guide and example override to the new default.
  • Adjust code comments, configuration struct, and tests (unit and SQL logic tests) for default‐true behavior.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
docs/source/user-guide/sql/ddl.md Note updated to describe reading files by default
docs/source/user-guide/configs.md Default value for collect_statistics set to true
docs/source/library-user-guide/upgrading.md Upgrade notes and example override updated
datafusion/sqllogictest/test_files/repartition.slt Physical plan ordering updated
datafusion/sqllogictest/test_files/parquet_statistics.slt Expected stats presence in plan updated
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt Removed redundant nodes, fixed spelling
datafusion/sqllogictest/test_files/limit.slt Stripped byte-range suffixes from file paths
datafusion/sqllogictest/test_files/information_schema.slt Schema output default toggled to true
datafusion/sqllogictest/test_files/explain_tree.slt Regenerated explain output formatting
datafusion/core/tests/parquet/row_group_pruning.rs Adjusted expected pruned_by_stats counts
datafusion/core/src/execution/context/parquet.rs Doc comments and test assertions updated
datafusion/common/src/config.rs Config default and doc comment updated
Comments suppressed due to low confidence (2)

docs/source/library-user-guide/upgrading.md:30

  • [nitpick] The sentence is incomplete. Consider rephrasing to clarify what is 'previous' (e.g., "restores the previous default behavior of ListingTable").
This change also restores the default behavior of `ListingTable` to its previous. If you use it directly

docs/source/library-user-guide/upgrading.md:36

  • Verify that the builder method name matches the actual API. If it's named with_collect_statistics rather than with_collect_stat, update this example accordingly.
    .with_collect_stat(false)

onursatici pushed a commit to vortex-data/vortex that referenced this pull request Jun 19, 2025
This is required for apache/datafusion#16447 to
take effect (once its merged). By default datafusion will take care of
that if you use the SQL interface (and do all the session state setup).

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
onursatici pushed a commit to vortex-data/vortex that referenced this pull request Jun 19, 2025
This is required for apache/datafusion#16447 to
take effect (once its merged). By default datafusion will take care of
that if you use the SQL interface (and do all the session state setup).

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Copy link
Contributor

@blaginin blaginin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fuzzy failed again, presumably due to #16452 - rerun resolved it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants