Skip to content

fix: extend recursive protection to prevent stack overflows in additional functions #16506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ahmed-mez
Copy link
Contributor

Which issue does this PR close?

Fixes stack overflows caused by deeply nested query plans during recursive function calls in various optimizer and expression evaluation paths.

Rationale for this change

DataFusion was experiencing stack overflow panics when processing deeply nested query plans. The issue occurred because many functions throughout the codebase perform recursive operations on query plans and expressions without stack overflow protection. When these operations encounter deeply nested structures, they can exhaust the call stack, causing the application to crash.

This change extends the existing recursive protection mechanism to additional functions across the optimizer, physical expression planner, and other core components that were previously unprotected.

Note that #16404 also helped mitigate the stack overflows I encountered with my reproducer, very likely because cloning Expr is not stack friendly especially when Expr is deeply nested. This is another data point in favour of #9577.

What changes are included in this PR?

  • Added #[cfg_attr(feature = "recursive_protection", recursive::recursive)] annotations to various recursive functions across multiple crates
  • Extended the recursive_protection feature to datafusion-physical-expr and datafusion-substrait crates

Are these changes tested?

The changes are primarily defensive additions that activate existing recursion protection when the recursive_protection feature is enabled. The recursion protection mechanism itself is already used in the codebase. Ideally we should work on automating the detection of lacking recursion protection.

About the reproducer: it's a 5MB+ Substrait JSON plan that's deeply nested (but valid). I don't see value in sharing it publicly as it contains various custom extensions and can't be used without an adequate Datadog-specific setup. However It helped me debug and validate the proposed changes.

Are there any user-facing changes?

No

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate substrait Changes to the substrait crate labels Jun 23, 2025
@alamb
Copy link
Contributor

alamb commented Jun 23, 2025

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing ahmed/recursion-protection (fe61624) to aa1e6da diff
BENCH_NAME=sql_planner
BENCH_COMMAND=cargo bench --bench sql_planner
BENCH_FILTER=
BENCH_BRANCH_NAME=ahmed_recursion-protection
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 23, 2025

🤖: Benchmark completed

Details

group                                         ahmed_recursion-protection             main
-----                                         --------------------------             ----
logical_aggregate_with_join                   1.01    632.1±3.78µs        ? ?/sec    1.00    627.7±2.76µs        ? ?/sec
logical_select_all_from_1000                  1.02     11.2±0.03ms        ? ?/sec    1.00     11.1±0.03ms        ? ?/sec
logical_select_one_from_700                   1.00    414.0±1.70µs        ? ?/sec    1.01    417.4±7.10µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00    374.8±1.41µs        ? ?/sec    1.00    373.5±2.43µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.01    362.1±2.71µs        ? ?/sec    1.00    358.7±1.67µs        ? ?/sec
physical_intersection                         1.00    837.1±3.59µs        ? ?/sec    1.00    835.1±6.04µs        ? ?/sec
physical_join_consider_sort                   1.01   1390.9±8.24µs        ? ?/sec    1.00   1383.5±3.64µs        ? ?/sec
physical_join_distinct                        1.00    352.1±2.40µs        ? ?/sec    1.00    350.7±1.16µs        ? ?/sec
physical_many_self_joins                      1.00     10.3±0.05ms        ? ?/sec    1.00     10.3±0.05ms        ? ?/sec
physical_plan_clickbench_all                  1.01    187.7±0.97ms        ? ?/sec    1.00    186.6±1.20ms        ? ?/sec
physical_plan_clickbench_q1                   1.02      2.5±0.02ms        ? ?/sec    1.00      2.5±0.02ms        ? ?/sec
physical_plan_clickbench_q10                  1.01      3.3±0.02ms        ? ?/sec    1.00      3.3±0.03ms        ? ?/sec
physical_plan_clickbench_q11                  1.01      3.5±0.03ms        ? ?/sec    1.00      3.5±0.04ms        ? ?/sec
physical_plan_clickbench_q12                  1.01      3.7±0.02ms        ? ?/sec    1.00      3.6±0.04ms        ? ?/sec
physical_plan_clickbench_q13                  1.02      3.3±0.04ms        ? ?/sec    1.00      3.3±0.03ms        ? ?/sec
physical_plan_clickbench_q14                  1.01      3.5±0.02ms        ? ?/sec    1.00      3.5±0.02ms        ? ?/sec
physical_plan_clickbench_q15                  1.01      3.4±0.03ms        ? ?/sec    1.00      3.4±0.03ms        ? ?/sec
physical_plan_clickbench_q16                  1.01      3.3±0.02ms        ? ?/sec    1.00      3.2±0.03ms        ? ?/sec
physical_plan_clickbench_q17                  1.01      3.4±0.03ms        ? ?/sec    1.00      3.3±0.03ms        ? ?/sec
physical_plan_clickbench_q18                  1.01      2.9±0.02ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
physical_plan_clickbench_q19                  1.02      3.8±0.05ms        ? ?/sec    1.00      3.8±0.04ms        ? ?/sec
physical_plan_clickbench_q2                   1.02      2.9±0.02ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
physical_plan_clickbench_q20                  1.01      2.6±0.02ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
physical_plan_clickbench_q21                  1.01      2.9±0.02ms        ? ?/sec    1.00      2.9±0.03ms        ? ?/sec
physical_plan_clickbench_q22                  1.00      3.5±0.02ms        ? ?/sec    1.00      3.5±0.04ms        ? ?/sec
physical_plan_clickbench_q23                  1.01      3.8±0.02ms        ? ?/sec    1.00      3.7±0.02ms        ? ?/sec
physical_plan_clickbench_q24                  1.00      4.3±0.03ms        ? ?/sec    1.00      4.3±0.02ms        ? ?/sec
physical_plan_clickbench_q25                  1.01      3.0±0.03ms        ? ?/sec    1.00      3.0±0.02ms        ? ?/sec
physical_plan_clickbench_q26                  1.00      2.9±0.02ms        ? ?/sec    1.00      2.9±0.06ms        ? ?/sec
physical_plan_clickbench_q27                  1.01      3.1±0.04ms        ? ?/sec    1.00      3.0±0.03ms        ? ?/sec
physical_plan_clickbench_q28                  1.00      3.8±0.03ms        ? ?/sec    1.00      3.8±0.03ms        ? ?/sec
physical_plan_clickbench_q29                  1.00      4.4±0.03ms        ? ?/sec    1.00      4.4±0.03ms        ? ?/sec
physical_plan_clickbench_q3                   1.01      2.9±0.02ms        ? ?/sec    1.00      2.8±0.02ms        ? ?/sec
physical_plan_clickbench_q30                  1.00     12.6±0.09ms        ? ?/sec    1.00     12.7±0.22ms        ? ?/sec
physical_plan_clickbench_q31                  1.01      3.8±0.03ms        ? ?/sec    1.00      3.8±0.03ms        ? ?/sec
physical_plan_clickbench_q32                  1.01      3.8±0.04ms        ? ?/sec    1.00      3.8±0.03ms        ? ?/sec
physical_plan_clickbench_q33                  1.01      3.3±0.03ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_clickbench_q34                  1.00      3.0±0.03ms        ? ?/sec    1.00      3.0±0.04ms        ? ?/sec
physical_plan_clickbench_q35                  1.01      3.1±0.02ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
physical_plan_clickbench_q36                  1.00      3.8±0.03ms        ? ?/sec    1.00      3.8±0.04ms        ? ?/sec
physical_plan_clickbench_q37                  1.00      3.8±0.02ms        ? ?/sec    1.00      3.8±0.05ms        ? ?/sec
physical_plan_clickbench_q38                  1.01      3.8±0.02ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
physical_plan_clickbench_q39                  1.01      3.7±0.03ms        ? ?/sec    1.00      3.6±0.04ms        ? ?/sec
physical_plan_clickbench_q4                   1.01      2.6±0.01ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
physical_plan_clickbench_q40                  1.00      4.3±0.04ms        ? ?/sec    1.00      4.3±0.05ms        ? ?/sec
physical_plan_clickbench_q41                  1.00      3.8±0.03ms        ? ?/sec    1.00      3.8±0.05ms        ? ?/sec
physical_plan_clickbench_q42                  1.01      3.8±0.02ms        ? ?/sec    1.00      3.7±0.03ms        ? ?/sec
physical_plan_clickbench_q43                  1.01      4.1±0.04ms        ? ?/sec    1.00      4.1±0.03ms        ? ?/sec
physical_plan_clickbench_q44                  1.00      2.7±0.01ms        ? ?/sec    1.00      2.7±0.02ms        ? ?/sec
physical_plan_clickbench_q45                  1.00      2.7±0.02ms        ? ?/sec    1.00      2.7±0.02ms        ? ?/sec
physical_plan_clickbench_q46                  1.01      3.1±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
physical_plan_clickbench_q47                  1.00      3.7±0.03ms        ? ?/sec    1.00      3.7±0.05ms        ? ?/sec
physical_plan_clickbench_q48                  1.02      4.4±0.06ms        ? ?/sec    1.00      4.3±0.03ms        ? ?/sec
physical_plan_clickbench_q49                  1.00      4.6±0.03ms        ? ?/sec    1.00      4.6±0.02ms        ? ?/sec
physical_plan_clickbench_q5                   1.01      2.8±0.03ms        ? ?/sec    1.00      2.8±0.02ms        ? ?/sec
physical_plan_clickbench_q50                  1.01      4.1±0.02ms        ? ?/sec    1.00      4.1±0.03ms        ? ?/sec
physical_plan_clickbench_q51                  1.01      3.2±0.03ms        ? ?/sec    1.00      3.2±0.02ms        ? ?/sec
physical_plan_clickbench_q52                  1.01      4.0±0.02ms        ? ?/sec    1.00      4.0±0.04ms        ? ?/sec
physical_plan_clickbench_q6                   1.00      2.8±0.02ms        ? ?/sec    1.00      2.8±0.02ms        ? ?/sec
physical_plan_clickbench_q7                   1.01      2.5±0.01ms        ? ?/sec    1.00      2.5±0.01ms        ? ?/sec
physical_plan_clickbench_q8                   1.00      3.4±0.02ms        ? ?/sec    1.00      3.4±0.04ms        ? ?/sec
physical_plan_clickbench_q9                   1.01      3.2±0.04ms        ? ?/sec    1.00      3.2±0.03ms        ? ?/sec
physical_plan_tpcds_all                       1.01   1032.4±5.43ms        ? ?/sec    1.00   1024.9±6.78ms        ? ?/sec
physical_plan_tpch_all                        1.01     62.6±0.85ms        ? ?/sec    1.00     61.7±0.42ms        ? ?/sec
physical_plan_tpch_q1                         1.00      2.0±0.02ms        ? ?/sec    1.00      2.0±0.01ms        ? ?/sec
physical_plan_tpch_q10                        1.00      3.8±0.01ms        ? ?/sec    1.00      3.8±0.01ms        ? ?/sec
physical_plan_tpch_q11                        1.00      3.3±0.01ms        ? ?/sec    1.00      3.2±0.02ms        ? ?/sec
physical_plan_tpch_q12                        1.01  1797.1±22.09µs        ? ?/sec    1.00  1786.7±15.28µs        ? ?/sec
physical_plan_tpch_q13                        1.00  1458.0±10.00µs        ? ?/sec    1.00  1451.1±31.91µs        ? ?/sec
physical_plan_tpch_q14                        1.01  1945.1±23.22µs        ? ?/sec    1.00   1919.5±6.34µs        ? ?/sec
physical_plan_tpch_q16                        1.00      2.4±0.01ms        ? ?/sec    1.00      2.4±0.03ms        ? ?/sec
physical_plan_tpch_q17                        1.01      2.4±0.03ms        ? ?/sec    1.00      2.4±0.03ms        ? ?/sec
physical_plan_tpch_q18                        1.01      2.7±0.04ms        ? ?/sec    1.00      2.7±0.01ms        ? ?/sec
physical_plan_tpch_q19                        1.01      3.2±0.01ms        ? ?/sec    1.00      3.2±0.02ms        ? ?/sec
physical_plan_tpch_q2                         1.01      5.5±0.03ms        ? ?/sec    1.00      5.4±0.05ms        ? ?/sec
physical_plan_tpch_q20                        1.01      3.1±0.02ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
physical_plan_tpch_q21                        1.01      4.1±0.04ms        ? ?/sec    1.00      4.0±0.01ms        ? ?/sec
physical_plan_tpch_q22                        1.02      2.7±0.05ms        ? ?/sec    1.00      2.7±0.01ms        ? ?/sec
physical_plan_tpch_q3                         1.01      2.5±0.03ms        ? ?/sec    1.00      2.5±0.03ms        ? ?/sec
physical_plan_tpch_q4                         1.01  1527.3±11.40µs        ? ?/sec    1.00  1510.1±17.99µs        ? ?/sec
physical_plan_tpch_q5                         1.00      3.1±0.02ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
physical_plan_tpch_q6                         1.01    863.5±7.80µs        ? ?/sec    1.00   855.2±11.47µs        ? ?/sec
physical_plan_tpch_q7                         1.00      4.2±0.01ms        ? ?/sec    1.00      4.2±0.02ms        ? ?/sec
physical_plan_tpch_q8                         1.01      5.2±0.06ms        ? ?/sec    1.00      5.1±0.04ms        ? ?/sec
physical_plan_tpch_q9                         1.01      4.1±0.05ms        ? ?/sec    1.00      4.1±0.01ms        ? ?/sec
physical_select_aggregates_from_200           1.00     17.6±0.06ms        ? ?/sec    1.00     17.7±0.06ms        ? ?/sec
physical_select_all_from_1000                 1.01     24.9±0.08ms        ? ?/sec    1.00     24.5±0.06ms        ? ?/sec
physical_select_one_from_700                  1.00   1075.0±4.64µs        ? ?/sec    1.00   1070.9±4.45µs        ? ?/sec
physical_sorted_union_orderby                 1.00     42.1±0.29ms        ? ?/sec    1.00     42.2±0.54ms        ? ?/sec
physical_theta_join_consider_sort             1.00   1758.7±9.81µs        ? ?/sec    1.00  1751.7±11.22µs        ? ?/sec
physical_unnest_to_join                       1.01  1318.6±15.89µs        ? ?/sec    1.00   1306.9±4.49µs        ? ?/sec
with_param_values_many_columns                1.00    127.5±0.67µs        ? ?/sec    1.00    127.6±0.63µs        ? ?/sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants