-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix wildcard expansion for HAVING
clause
#12046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# because v2 is not in the group by clause, the sql is invalid | ||
query error | ||
select * from having_test group by v1 having max(v1) = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error won't be detected when planning the aggregation because we won't expand the wildcard there. This error will be thrown when ExpandWildcardRule
.
datafusion/sql/src/select.rs
Outdated
let select_exprs_post_aggr = if having_expr_opt.is_some() | ||
&& group_by_exprs.is_empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expand the wildcard only when we ensure the query isn't valid.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of expanding wildcard, does it make sense to return the error if neither column is in group by nor is it part of the aggregate function?
D select * from t having max(a) = 1;
Binder Error: column "a" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(a)" if the exact value of "a" is not important.
In this case, if we see wildcard, we can just return the error without actually expanding the actual columns. Maybe modify check_columns_satisfy_exprs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of expanding wildcard, does it make sense to return the error if neither column is in group by nor is it part of the aggregate function?
D select * from t having max(a) = 1; Binder Error: column "a" must appear in the GROUP BY clause or must be part of an aggregate function. Either add it to the GROUP BY list, or use "ANY_VALUE(a)" if the exact value of "a" is not important.
I think it makes sense for the having
function. I can't find any case that violates this rule 🤔.
In this case, if we see wildcard, we can just return the error without actually expanding the actual columns. Maybe modify
check_columns_satisfy_exprs
Instead of modifying check_columns_satisfy_exprs
, I guess it would be checked at the part of the building having expression.
datafusion/datafusion/sql/src/select.rs
Lines 762 to 770 in a91be04
let having_expr_post_aggr = if let Some(having_expr) = having_expr_opt { | |
let having_expr_post_aggr = | |
rebase_expr(having_expr, &aggr_projection_exprs, input)?; | |
check_columns_satisfy_exprs( | |
&column_exprs_post_aggr, | |
&[having_expr_post_aggr.clone()], | |
"HAVING clause references non-aggregate values", | |
)?; |
check_columns_satisfy_exprs
is also used to check the normal aggregation. I think the rule can't be shared.
@@ -214,7 +215,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
|
|||
let plan = if let Some(having_expr_post_aggr) = having_expr_post_aggr { | |||
LogicalPlanBuilder::from(plan) | |||
.filter(having_expr_post_aggr)? | |||
.having(having_expr_post_aggr)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important to mention the filter is used for the HAVING
clause because the wildcard expansion behavior is different.
/// Apply a filter which is used for a having clause | ||
pub fn having(self, expr: impl Into<Expr>) -> Result<Self> { | ||
let expr = normalize_col(expr.into(), &self.plan)?; | ||
Filter::try_new_having(expr, Arc::new(self.plan)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: try_new_with_having (?
@@ -301,4 +302,32 @@ mod tests { | |||
|
|||
Ok(()) | |||
} | |||
|
|||
#[test] | |||
fn plan_having_wildcard_projection() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could test this kind of test in slt with explain
. I prefer to add rust test only for non-trivial cases
CI appears to be failing so marking this as draft while we work through those details |
datafusion/sql/src/select.rs
Outdated
@@ -749,11 +748,15 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
.map(|expr| rebase_expr(expr, &aggr_projection_exprs, input)) | |||
.collect::<Result<Vec<Expr>>>()?; | |||
|
|||
let wildcard_exprs = select_exprs_post_aggr.iter().filter(|expr| matches!(expr, Expr::Wildcard { .. })).collect::<Vec<_>>(); | |||
let wildcard_fields = exprlist_to_fields(wildcard_exprs, input)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that there are too many cases to handle, making it difficult to manage them cleanly. This could complicate the codebase. Currently, since we've expanded the wildcard for the schema, we can retrieve what we need based on the expanded schema.
This approach also allows us to provide correct and appropriate error messages for the user
datafusion/sql/src/select.rs
Outdated
@@ -766,9 +769,9 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
check_columns_satisfy_exprs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have check_columns_satisfy_exprs_for_having_clause
if the logic is different for projection and having
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the function can be shared. The difference is that the check of having
requires a return value to confirm that the validation has been executed and is valid.
datafusion/sql/src/utils.rs
Outdated
|
||
wildcard_fields.into_iter().try_for_each(|(table, field)| { | ||
let column_name = qualified_name(table, field.name()); | ||
if !column_names.iter().any(|c| c == &column_name) { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two valid cases
In group by clause
group by a having a = 1
;
Part of an aggregate function
2. select max(a) from t having max(a) = 1
;
I think we just need to check group by expressions and both select + having expressions.
If we got wildcard, I think we could ignore it, it doesn't matter 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm ok. I added back the checking for the rules. I tried to add more comments to explain why we need those checks.
About the error message, we could enhance it in the future 🤔 .
datafusion/sql/src/utils.rs
Outdated
@@ -119,9 +121,34 @@ pub(crate) fn check_columns_satisfy_exprs( | |||
_ => check_column_satisfies_expr(columns, e, message_prefix)?, | |||
} | |||
} | |||
let column_names = columns | |||
.iter() | |||
.map(|c| format!("{}", c.schema_name())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eq to c.schem_name().to_string()
@goldmedal I didn't find any query that go through the additional check, given the current test cases you added they don't require the additional check too. I suggest we start with simple test case and gradually fix more query and error message. I don't know why can't I send the pull request to your repo, so I create another one in apache/datafusion #12106. This is just a demo for you, not for merging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @goldmedal
Thanks @jayzhan211 |
🎉 |
Which issue does this PR close?
Closes #12013 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?