Skip to content

TopK dynamic filter pushdown attempt 2 #15770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jun 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions datafusion/common/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -614,6 +614,13 @@ config_namespace! {
/// during aggregations, if possible
pub enable_topk_aggregation: bool, default = true

/// When set to true attempts to push down dynamic filters generated by operators into the file scan phase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could also extend non-scan operators with the dynamic filter pushdown?

It would be interesting to extend filters / aggregates / joins / etc. with the support of this as well...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the idea be to prune hash table state, for example, if we knew some of the groups were no longer needed?

I do think implementing more "late materialization" (aka turn on filter_pushdown) will help too

/// For example, for a query such as `SELECT * FROM t ORDER BY timestamp DESC LIMIT 10`, the optimizer
/// will attempt to push down the current top 10 timestamps that the TopK operator references into the file scans.
/// This means that if we already have 10 timestamps in the year 2025
/// any files that only have timestamps in the year 2024 can be skipped / pruned at various stages in the scan.
pub enable_dynamic_filter_pushdown: bool, default = true

/// When set to true, the optimizer will insert filters before a join between
/// a nullable and non-nullable column to filter out nulls on the nullable side. This
/// filter can add additional overhead when the file format does not fully support
Expand Down
1 change: 1 addition & 0 deletions datafusion/core/tests/fuzz_cases/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ mod join_fuzz;
mod merge_fuzz;
mod sort_fuzz;
mod sort_query_fuzz;
mod topk_filter_pushdown;

mod aggregation_fuzzer;
mod equivalence;
Expand Down
Loading