Skip to content

feat: optimize and unparse grouping #16161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

chenkovsky
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

first, it seems that grouping udaf document is not correct.

and for aggregation with grouping,

e.g.

CREATE TABLE test (c1 VARCHAR,c2 VARCHAR,c3 INT) as values
('a','A',1), ('b','B',2);

EXPLAIN FORMAT INDENT select
  c1,
  c2,
  CASE WHEN grouping(c1) = 1 THEN sum(c3) ELSE NULL END as gx,
  grouping(c1) as g0,
  grouping(c2) as g1,
  grouping(c1, c2) as g2,
  grouping(c2, c1) as g3
from
  test
group by
  grouping sets (
    (c1, c2),
    (c1),
    (c2),
    ()
  );

current logical plan is:

| logical_plan  | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) = Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, grouping(test.c2,test.c1) AS g3                                                                                                              |
|               |   Projection: test.c1, test.c2, CAST(__common_expr_1 AS Int32) AS grouping(test.c1), sum(test.c3), CAST(__common_expr_2 AS Int32) AS grouping(test.c2), CAST(__grouping_id AS Int32) AS grouping(test.c1,test.c2), CAST(__common_expr_1 | __common_expr_2 << UInt8(1) AS Int32) AS grouping(test.c2,test.c1)                                       |
|               |     Projection: __grouping_id & UInt8(2) >> UInt8(1) AS __common_expr_1, __grouping_id & UInt8(1) AS __common_expr_2, test.c1, test.c2, __grouping_id, sum(test.c3)                                                                                                                                                                                |
|               |       Aggregate: groupBy=[[GROUPING SETS ((test.c1, test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]]                                                                                                                                                                                                                    |
|               |         TableScan: test projection=[c1, c2, c3]  

the problems are:

  1. there are three projections. for bitwise operation, there's no benifit for extra projection.

  2. it makes grouping level optimization very hard. for example,

     select  case when grouping(c1) = 1 then sum(c2) else null end from test group by grouping sets (...)

    we only need to calculate sum(c2) in when grouping(c1) = 1, this is a very useful optimization trick. I'm also using this trick to optimizing sql with multiple count distinct.

  3. unparsing is not supported. Internal("Tried to unproject column referring to internal grouping id") will be thrown.

What changes are included in this PR?

  1. create a grouping udf.
  2. modify analyzer, replace grouping_udaf with grouping udf
  3. change it back, when unparsing.

now, the logical plan is:

| logical_plan  | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) = Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, grouping(test.c2,test.c1) AS g3                                                                             |
|               |   Projection: test.c1, test.c2, grouping(__grouping_id, List([1])) AS grouping(test.c1), sum(test.c3), grouping(__grouping_id, List([0])) AS grouping(test.c2), grouping(__grouping_id) AS grouping(test.c1,test.c2), grouping(__grouping_id, List([0, 1])) AS grouping(test.c2,test.c1)                          |
|               |     Aggregate: groupBy=[[GROUPING SETS ((test.c1, test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]]                                                                                                                                                                                     |
|               |       TableScan: test projection=[c1, c2, c3]                                                                                                                                                                             

there are only two projections. and unparsing is supported.

Are these changes tested?

UT

Are there any user-facing changes?

No

@github-actions github-actions bot added sql SQL Planner optimizer Optimizer rules functions Changes to functions implementation labels May 23, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 23, 2025
@alamb
Copy link
Contributor

alamb commented May 24, 2025

Thanks @chenkovsky -- can. you find the original PR that added this GROUPING function and perhaps @ mention the author to see if they have any feedback / could help with review?

@chenkovsky
Copy link
Contributor Author

chenkovsky commented May 24, 2025

@eejbyfeldt @comphead could you please help me review this PR?

@chenkovsky
Copy link
Contributor Author

it's related to #12704

@eejbyfeldt
Copy link
Contributor

I did not look closely at yet since I have not really contributed here in months.

there are three projections. for bitwise operation, there's no benifit for extra projection.

This extra projection is introduced by CommonSubexprEliminate. I understand that the changes make it so that they do not interact in the presented query, but is this really a general fix of that behavior?

But the goal of leaving more structure in logical plan after resolving the grouping expr so that optimization and unparsing sounds reasonable to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions Changes to functions implementation optimizer Optimizer rules sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants