Skip to content

[SPARK-52705][SQL] Refactor deterministic check for grouping expressions #51391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

mihailotim-db
Copy link
Contributor

@mihailotim-db mihailotim-db commented Jul 7, 2025

What changes were proposed in this pull request?

Move check for non-deterministic expressions in grouping expressions from ExprUtils to CheckAnalysis.

Why are the changes needed?

This is necessary in order to be able to utilize PullOutNonDeterminstic rule as a post-processing rewrite rule in single-pass analyzer. Because ExprUtils.assertValidAggregate is called during the bottom-up traversal, we can't check for non-determinstic expressions there

Does this PR introduce any user-facing change?

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jul 7, 2025
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/pull_out_nondeterministic branch from cc1ec86 to 1f808b5 Compare July 7, 2025 20:15
@mihailotim-db mihailotim-db changed the title Pull out nondeterminstic [SPARK-52705][SQL] Refactor deterministic check for grouping expressions Jul 8, 2025
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/pull_out_nondeterministic branch from 1f808b5 to ab449cc Compare July 8, 2025 06:25
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/pull_out_nondeterministic branch from ab449cc to af50c4c Compare July 8, 2025 06:53
@cloud-fan
Copy link
Contributor

cloud-fan commented Jul 8, 2025

the spark connect test failure is unrelated, thanks, merging to master!

@cloud-fan cloud-fan closed this in 8915c60 Jul 8, 2025
"grouping expression.",
context = expr.origin.getQueryContext,
summary = expr.origin.context.summary)
}
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihailotim-db and @cloud-fan , is this safe to remove?

This PR only added this logic at CheckAnalysis.scala and PullOutNondeterministic.scala. However, assertValidAggregation seems to be used by more places?

$ git grep assertValidAggregation | grep -v CheckAnalysis.scala | grep -v PullOutNondeterministic.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateResolver.scala:   * [[Aggregate]] using the [[ExprUtils.assertValidAggregation]], update the `scopes` with the
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateResolver.scala:        ExprUtils.assertValidAggregation(resolvedAggregate)
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateResolver.scala:   * unintentional exceptions from being thrown by [[ExprUtils.assertValidAggregation]], so the
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/LateralColumnAliasResolver.scala:        ExprUtils.assertValidAggregation(aggregate)
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/ProjectResolver.scala:   * validate it using the [[ExprUtils.assertValidAggregation]].
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/ProjectResolver.scala:          ExprUtils.assertValidAggregation(aggregate)
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:  def assertValidAggregation(a: Aggregate): Unit = {
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:          ExprUtils.assertValidAggregation(a)

Can we add back this to be safe? I don't see any negative effect to have this here. Do have a chance for this to cause any failure at single pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving this to CheckAnalysis. It going to be ran just before assertValidAggregation. For single-pass, we are going to run PullOutNondeterministic after we finish resolving the plan. Because we are calling ExprUtils.assertValidAggregation during the bottom-up pass, we can't have this check there because it would cause false positive failures (non-deterministic expressions would get pushed down later)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, can I understand like you are saying that this PR is technically reducing test coverage of this code path (by reducing the invocation paths) due to the single-pass requirement, @mihailotim-db ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the sense that we are not invoking this between rules but only in CheckAnalysis, yes. But from what I can see, we only call assertValidAggregation from single-pass analyzer, CheckAnalysis and Optimizer, so we don't have an issue there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mihailotim-db for the explanation. Late LGTM.

@dongjoon-hyun
Copy link
Member

cc @peter-toth

asl3 pushed a commit to asl3/spark that referenced this pull request Jul 14, 2025
### What changes were proposed in this pull request?

Move check for non-deterministic expressions in grouping expressions from `ExprUtils` to `CheckAnalysis`.
### Why are the changes needed?
This is necessary in order to be able to utilize `PullOutNonDeterminstic` rule as a post-processing rewrite rule in single-pass analyzer. Because `ExprUtils.assertValidAggregate` is called during the bottom-up traversal, we can't check for non-determinstic expressions there

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#51391 from mihailotim-db/mihailotim-db/pull_out_nondeterministic.

Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ksbeyer pushed a commit to ksbeyer/spark that referenced this pull request Jul 14, 2025
### What changes were proposed in this pull request?

Move check for non-deterministic expressions in grouping expressions from `ExprUtils` to `CheckAnalysis`.
### Why are the changes needed?
This is necessary in order to be able to utilize `PullOutNonDeterminstic` rule as a post-processing rewrite rule in single-pass analyzer. Because `ExprUtils.assertValidAggregate` is called during the bottom-up traversal, we can't check for non-determinstic expressions there

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#51391 from mihailotim-db/mihailotim-db/pull_out_nondeterministic.

Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
haoyangeng-db pushed a commit to haoyangeng-db/apache-spark that referenced this pull request Jul 22, 2025
### What changes were proposed in this pull request?

Move check for non-deterministic expressions in grouping expressions from `ExprUtils` to `CheckAnalysis`.
### Why are the changes needed?
This is necessary in order to be able to utilize `PullOutNonDeterminstic` rule as a post-processing rewrite rule in single-pass analyzer. Because `ExprUtils.assertValidAggregate` is called during the bottom-up traversal, we can't check for non-determinstic expressions there

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#51391 from mihailotim-db/mihailotim-db/pull_out_nondeterministic.

Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants