-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52705][SQL] Refactor deterministic check for grouping expressions #51391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-52705][SQL] Refactor deterministic check for grouping expressions #51391
Conversation
cc1ec86
to
1f808b5
Compare
1f808b5
to
ab449cc
Compare
ab449cc
to
af50c4c
Compare
the spark connect test failure is unrelated, thanks, merging to master! |
"grouping expression.", | ||
context = expr.origin.getQueryContext, | ||
summary = expr.origin.context.summary) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihailotim-db and @cloud-fan , is this safe to remove?
This PR only added this logic at CheckAnalysis.scala
and PullOutNondeterministic.scala
. However, assertValidAggregation
seems to be used by more places?
$ git grep assertValidAggregation | grep -v CheckAnalysis.scala | grep -v PullOutNondeterministic.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateResolver.scala: * [[Aggregate]] using the [[ExprUtils.assertValidAggregation]], update the `scopes` with the
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateResolver.scala: ExprUtils.assertValidAggregation(resolvedAggregate)
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateResolver.scala: * unintentional exceptions from being thrown by [[ExprUtils.assertValidAggregation]], so the
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/LateralColumnAliasResolver.scala: ExprUtils.assertValidAggregation(aggregate)
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/ProjectResolver.scala: * validate it using the [[ExprUtils.assertValidAggregation]].
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/ProjectResolver.scala: ExprUtils.assertValidAggregation(aggregate)
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala: def assertValidAggregation(a: Aggregate): Unit = {
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala: ExprUtils.assertValidAggregation(a)
Can we add back this to be safe? I don't see any negative effect to have this here. Do have a chance for this to cause any failure at single pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving this to CheckAnalysis
. It going to be ran just before assertValidAggregation
. For single-pass, we are going to run PullOutNondeterministic
after we finish resolving the plan. Because we are calling ExprUtils.assertValidAggregation
during the bottom-up pass, we can't have this check there because it would cause false positive failures (non-deterministic expressions would get pushed down later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, can I understand like you are saying that this PR is technically reducing test coverage of this code path (by reducing the invocation paths) due to the single-pass requirement, @mihailotim-db ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the sense that we are not invoking this between rules but only in CheckAnalysis
, yes. But from what I can see, we only call assertValidAggregation
from single-pass analyzer, CheckAnalysis
and Optimizer, so we don't have an issue there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mihailotim-db for the explanation. Late LGTM.
cc @peter-toth |
### What changes were proposed in this pull request? Move check for non-deterministic expressions in grouping expressions from `ExprUtils` to `CheckAnalysis`. ### Why are the changes needed? This is necessary in order to be able to utilize `PullOutNonDeterminstic` rule as a post-processing rewrite rule in single-pass analyzer. Because `ExprUtils.assertValidAggregate` is called during the bottom-up traversal, we can't check for non-determinstic expressions there ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51391 from mihailotim-db/mihailotim-db/pull_out_nondeterministic. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Move check for non-deterministic expressions in grouping expressions from `ExprUtils` to `CheckAnalysis`. ### Why are the changes needed? This is necessary in order to be able to utilize `PullOutNonDeterminstic` rule as a post-processing rewrite rule in single-pass analyzer. Because `ExprUtils.assertValidAggregate` is called during the bottom-up traversal, we can't check for non-determinstic expressions there ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51391 from mihailotim-db/mihailotim-db/pull_out_nondeterministic. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Move check for non-deterministic expressions in grouping expressions from `ExprUtils` to `CheckAnalysis`. ### Why are the changes needed? This is necessary in order to be able to utilize `PullOutNonDeterminstic` rule as a post-processing rewrite rule in single-pass analyzer. Because `ExprUtils.assertValidAggregate` is called during the bottom-up traversal, we can't check for non-determinstic expressions there ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51391 from mihailotim-db/mihailotim-db/pull_out_nondeterministic. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Move check for non-deterministic expressions in grouping expressions from
ExprUtils
toCheckAnalysis
.Why are the changes needed?
This is necessary in order to be able to utilize
PullOutNonDeterminstic
rule as a post-processing rewrite rule in single-pass analyzer. BecauseExprUtils.assertValidAggregate
is called during the bottom-up traversal, we can't check for non-determinstic expressions thereDoes this PR introduce any user-facing change?
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
No