Multiple 'group by's, one scan #15982

pepijnve · 2025-05-07T16:05:02Z

pepijnve
May 7, 2025

In the system I'm working on I want to perform multiple aggregates using different group by criteria over large data sets.
I don't think grouping sets are an option since those support computing a single set of aggregates over multiple groupings. What I'm trying to achieve instead is one multiple sets of aggregates that each have their own group by strategy.

A simple way to do this is to just run multiple queries of course. That works but requires scanning through the data multiple times. That becomes prohibitive pretty quickly as the number of sets of aggregates increases.
While I was experimenting with the multiple query approach and combining those into a single query using 'union all' I started wondering if I couldn't write an operator to have my cake and eat it. So rather than this:

select 1 as setid, k1 as groupkey1, count(1) as agg1, null as groupkey2, null as agg2 from table group by groupkey1
union all
select 2 as setid, null as groupkey1, null as agg1, k2 as groupkey2, sum(col1) as agg2 from table group by groupkey2

which results in a logical plan that sort of looks like this (edited for brevity/clarity)

Union
   Projection: 1 AS setid, k1 AS groupkey1, count(1) AS agg1, NULL AS groupkey2, NULL AS agg2
     Aggregate: groupBy=[k1], aggr=[[count(1)]] |
       TableScan: table
   Projection: 2 AS setid, NULL AS groupkey1, NULL AS agg1, k2 AS groupkey2, count(1) AS agg2
     Aggregate: groupBy=[k2], aggr=[[sum(col1)]]
       TableScan: table

what I would want to do instead is something like this

Unify
   Projection: 1 AS setid, k1 AS groupkey1, count(1) AS agg1, NULL AS groupkey2, NULL AS agg2
     Aggregate: groupBy=[k1], aggr=[[count(1)]] |
       CommonInputPlaceholder
   Projection: 2 AS setid, NULL AS groupkey1, NULL AS agg1, k2 AS groupkey2, count(1) AS agg2
     Aggregate: groupBy=[k2], aggr=[[sum(col1)]]
       CommonInputPlaceholder
   CommonInput
      TableScan: table

CommonInputPlaceholder is a stub node that has the same schema as the CommonInput child.
The Unify operator works by setting up queues for each CommonInputPlaceholder. It polls the CommonInput child, and places a duplicate of each record batch it receives onto each queue. This is kind of similar to how RepartitionExec does its thing but instead of assigning each record batch once, we duplicate and assign it multiple times.

With quite some trial and error I've been able to get something up and running, but I have a feeling I'm going against the grain of the framework. Getting the optimizer to do the right thing for instance proved to be a challenge since it expects plans to be trees rather than DAGs.

My question for the group is if someone else has tried to implement something like this before? Or if what I'm trying to accomplish can be done in some other way? Perhaps someone has advice on how to best go about implementing this?
I realize this colors outside the lines of what you can express in SQL (as far as I know at least). I'm creating my queries by directly instantiating logical plans so for now that's not an issue for the system I'm working on.

Edit: I accidentally ended up writing an example that can be done with grouping sets since the sets of aggregates were identical. Update example to use different aggregates.

pepijnve · 2025-05-13T16:30:46Z

pepijnve
May 13, 2025
Author

In the meantime I've found duckdb/duckdb#8445 which is similar to what we're trying to accomplish. In this DuckDB proposal the scan sharing is not explicit in the query plan. Instead the planner/optimizer would detect the duplicate execution plan portions and rearranges things accordingly.

0 replies

xudong963 · 2025-05-14T14:59:40Z

xudong963
May 14, 2025
Collaborator

This is a great feature, and I also believe it'll speed up tpch: https://github.com/dragansah/tpch-dbgen/blob/master/tpch-queries/21.sql, https://github.com/dragansah/tpch-dbgen/blob/master/tpch-queries/15.sql

0 replies

alamb · 2025-05-15T13:00:02Z

alamb
May 15, 2025
Collaborator

There is some additional discussion on a similar sounding feature here:

Avoid recompute CTEs (common table expressions) / share input plans #8777

Another potential approach is to fully materialize the input (INSERT INTO temp_file.parquet SELECT ...) or CREATE TABLE temp as SELECT ...

Basically the tricky bit with an approach to share the scan is that in the general case the CommonInput would have to buffer the entire input anyways (if the consumers consumed at different rates)

4 replies

pepijnve May 15, 2025
Author

Perhaps satisfying the non-general case would already be of value? For the diamond self-join example in the linked issue it doesn't make much sense indeed. Can you think of other examples besides joins where the same this would occur?

pepijnve May 15, 2025
Author

Just FYI, in the particular case I'm working on the problem I'm dealing with is that I want to compute a whole bunch of aggregates over a table with a cardinality in the billions order or magnitude. For n aggregations I'm trying to reduce the time spent on table scan from n * s to s because s is pretty significant.

alamb May 15, 2025
Collaborator

The other types of plans I have seen this cause problems is when the operators rely on sort order -- so like SortPreservingMerge or a group by where the data is partially sorted on some of the group keys

pepijnve May 15, 2025
Author

I read through the linked issues in the meantime. I think what we're trying to do is closest to the Splitter idea described in the linked document at #8558 (comment)

The part we've been struggling with a bit is how to fit this in the framework of DataFusion without breaking assumptions. While it's definitely already feasible to create diamond plans in code, my colleague and I got the impression that most code has been written under the assumption that nodes only have a single parent. One example that comes to mind was projection pushdown where each branch would push down only the columns it needed.

If you have any guidance/pointers on how to go about tackling this problem without going against the grain of the library, that would be much appreciated. Happy to contribute the work upstream if we can get beyond the point of a hack to something sufficiently general and usable for others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiple 'group by's, one scan #15982

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multiple 'group by's, one scan #15982

Uh oh!

Uh oh!

pepijnve May 7, 2025

Replies: 3 comments · 4 replies

Uh oh!

pepijnve May 13, 2025 Author

Uh oh!

xudong963 May 14, 2025 Collaborator

Uh oh!

alamb May 15, 2025 Collaborator

Uh oh!

pepijnve May 15, 2025 Author

Uh oh!

pepijnve May 15, 2025 Author

Uh oh!

alamb May 15, 2025 Collaborator

Uh oh!

pepijnve May 15, 2025 Author

pepijnve
May 7, 2025

Replies: 3 comments 4 replies

pepijnve
May 13, 2025
Author

xudong963
May 14, 2025
Collaborator

alamb
May 15, 2025
Collaborator

pepijnve May 15, 2025
Author

pepijnve May 15, 2025
Author

alamb May 15, 2025
Collaborator

pepijnve May 15, 2025
Author