Skip to content

Fix wildcard expansion for HAVING clause #12046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Aug 22, 2024

Conversation

goldmedal
Copy link
Contributor

Which issue does this PR close?

Closes #12013 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Aug 18, 2024
Comment on lines 5661 to 5663
# because v2 is not in the group by clause, the sql is invalid
query error
select * from having_test group by v1 having max(v1) = 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error won't be detected when planning the aggregation because we won't expand the wildcard there. This error will be thrown when ExpandWildcardRule.

Comment on lines 756 to 757
let select_exprs_post_aggr = if having_expr_opt.is_some()
&& group_by_exprs.is_empty()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand the wildcard only when we ensure the query isn't valid.

This comment was marked as outdated.

This comment was marked as outdated.

Copy link
Contributor

@jayzhan211 jayzhan211 Aug 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of expanding wildcard, does it make sense to return the error if neither column is in group by nor is it part of the aggregate function?

D select * from t having max(a) = 1;
Binder Error: column "a" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(a)" if the exact value of "a" is not important.

In this case, if we see wildcard, we can just return the error without actually expanding the actual columns. Maybe modify check_columns_satisfy_exprs

Copy link
Contributor Author

@goldmedal goldmedal Aug 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of expanding wildcard, does it make sense to return the error if neither column is in group by nor is it part of the aggregate function?

D select * from t having max(a) = 1;
Binder Error: column "a" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(a)" if the exact value of "a" is not important.

I think it makes sense for the having function. I can't find any case that violates this rule 🤔.

In this case, if we see wildcard, we can just return the error without actually expanding the actual columns. Maybe modify check_columns_satisfy_exprs

Instead of modifying check_columns_satisfy_exprs, I guess it would be checked at the part of the building having expression.

let having_expr_post_aggr = if let Some(having_expr) = having_expr_opt {
let having_expr_post_aggr =
rebase_expr(having_expr, &aggr_projection_exprs, input)?;
check_columns_satisfy_exprs(
&column_exprs_post_aggr,
&[having_expr_post_aggr.clone()],
"HAVING clause references non-aggregate values",
)?;

check_columns_satisfy_exprs is also used to check the normal aggregation. I think the rule can't be shared.

@@ -214,7 +215,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {

let plan = if let Some(having_expr_post_aggr) = having_expr_post_aggr {
LogicalPlanBuilder::from(plan)
.filter(having_expr_post_aggr)?
.having(having_expr_post_aggr)?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to mention the filter is used for the HAVING clause because the wildcard expansion behavior is different.

/// Apply a filter which is used for a having clause
pub fn having(self, expr: impl Into<Expr>) -> Result<Self> {
let expr = normalize_col(expr.into(), &self.plan)?;
Filter::try_new_having(expr, Arc::new(self.plan))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: try_new_with_having (?

@@ -301,4 +302,32 @@ mod tests {

Ok(())
}

#[test]
fn plan_having_wildcard_projection() -> Result<()> {
Copy link
Contributor

@jayzhan211 jayzhan211 Aug 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could test this kind of test in slt with explain. I prefer to add rust test only for non-trivial cases

@alamb alamb marked this pull request as draft August 19, 2024 18:25
@alamb
Copy link
Contributor

alamb commented Aug 19, 2024

CI appears to be failing so marking this as draft while we work through those details

@@ -749,11 +748,15 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
.map(|expr| rebase_expr(expr, &aggr_projection_exprs, input))
.collect::<Result<Vec<Expr>>>()?;

let wildcard_exprs = select_exprs_post_aggr.iter().filter(|expr| matches!(expr, Expr::Wildcard { .. })).collect::<Vec<_>>();
let wildcard_fields = exprlist_to_fields(wildcard_exprs, input)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that there are too many cases to handle, making it difficult to manage them cleanly. This could complicate the codebase. Currently, since we've expanded the wildcard for the schema, we can retrieve what we need based on the expanded schema.
This approach also allows us to provide correct and appropriate error messages for the user

@goldmedal goldmedal marked this pull request as ready for review August 20, 2024 16:08
@goldmedal goldmedal marked this pull request as draft August 20, 2024 16:08
@goldmedal goldmedal marked this pull request as ready for review August 20, 2024 16:34
@@ -766,9 +769,9 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
check_columns_satisfy_exprs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have check_columns_satisfy_exprs_for_having_clause if the logic is different for projection and having

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the function can be shared. The difference is that the check of having requires a return value to confirm that the validation has been executed and is valid.


wildcard_fields.into_iter().try_for_each(|(table, field)| {
let column_name = qualified_name(table, field.name());
if !column_names.iter().any(|c| c == &column_name) {

This comment was marked as outdated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two valid cases

In group by clause

  1. group by a having a = 1;

Part of an aggregate function
2. select max(a) from t having max(a) = 1;

I think we just need to check group by expressions and both select + having expressions.
If we got wildcard, I think we could ignore it, it doesn't matter 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok. I added back the checking for the rules. I tried to add more comments to explain why we need those checks.
About the error message, we could enhance it in the future 🤔 .

@@ -119,9 +121,34 @@ pub(crate) fn check_columns_satisfy_exprs(
_ => check_column_satisfies_expr(columns, e, message_prefix)?,
}
}
let column_names = columns
.iter()
.map(|c| format!("{}", c.schema_name()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eq to c.schem_name().to_string()

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 22, 2024

@goldmedal I didn't find any query that go through the additional check, given the current test cases you added they don't require the additional check too. I suggest we start with simple test case and gradually fix more query and error message.

I don't know why can't I send the pull request to your repo, so I create another one in apache/datafusion #12106. This is just a demo for you, not for merging

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @goldmedal

@jayzhan211 jayzhan211 merged commit a50aeef into apache:main Aug 22, 2024
24 checks passed
@goldmedal goldmedal deleted the bugfix/12013-fix-having branch August 22, 2024 05:07
@goldmedal
Copy link
Contributor Author

Thanks @jayzhan211

@alamb
Copy link
Contributor

alamb commented Aug 22, 2024

🎉

@alamb alamb mentioned this pull request Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid aggregate SQL query with HAVING can be executed without error (SQLancer-TLP)
3 participants