[SPARK-53848] Add ability to support Alpha family in Theta Aggregates #52551

karuppayya · 2025-10-08T21:50:19Z

What changes were proposed in this pull request?

Adding ability to use ALPHA family for Theta Sketch

Why are the changes needed?

Theta sketch aggregate currently supports only quick select.
Consumers like Iceberg will benefit from the sketch aggregate if has the ability to use ALPHA family
Iceberg specification to use ALPHA sketches
Custom implementation of theta sketch aggregates in Iceberg that can be replaced with Spark Theta aggregates

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

karuppayya · 2025-10-10T14:23:14Z

@cboumalh @mkaravel @dtenedor @cloud-fan Can you help review?
cc: @aokolnychyi @huaxingao

cboumalh · 2025-10-10T16:39:01Z

Hi @karuppayya. Thanks for the effort here. Is the goal to deprecate the aggregation implementation you linked above and call the one from here directly in the Iceberg code?

karuppayya · 2025-10-10T16:43:33Z

Yes, that's right. The goal is to deprecate the custom aggregation implementation(which was done since Spark didnt have the ability then) and have Iceberg call Spark's Theta aggregate/estimate functions directly.
I used Iceberg as a concrete example which could use this functionality when it becomes available, but this change will benefit consumers/users in general.

cboumalh · 2025-10-10T16:56:23Z

I see that makes sense. Just want to point out if the only need for iceberg is approximating count, Spark's HLL can achieve it at similar speeds (though not as fast) with a much smaller memory footprint. How would this work for your case?

karuppayya · 2025-10-10T17:10:53Z

The selection of Alpha family sketch comes from the Iceberg Specification for NDV stats.
Changing this would break the interoperability guarantee that Iceberg provides across engines like Spark, Trino, and Flink.
I took Iceberg as an example, but having the flexibility in choosing the skecth family will benefit users in general.

cboumalh · 2025-10-10T17:19:19Z

Sounds good thank you, and yes for superior update speeds, the Alpha family is beneficial to users. I will review on my end, but don't have any write permissions. Thanks again for the info!

common/utils/src/main/resources/error/error-conditions.json

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ThetaSketchUtils.scala

...main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala

cboumalh · 2025-10-10T20:50:14Z

sql/api/src/main/scala/org/apache/spark/sql/functions.scala

+   * @since 4.1.0
+   */
+  def theta_sketch_agg(e: Column, family: String): Column =
+    theta_sketch_agg(e, 12, family)


The hardcoded 12 matches the Catalyst default, so behaviorally it’s fine. Still, duplicating an internal default in the public layer is a bit awkward .We could consider making this explicit (functions above preferred) or defining a local constant for clarity.

Yeah, I gave some thoughts on this before adding it here. Ideally the resolution to use 12 for logNomEntries should be in ThetaSketchAgg.
ie to say if I just pass the column and family, the logNomEntries gets defaulted in the ThetaSketchAgg.

But,
FunctionRegistry selects the constructor based on the number of expressions. I cannot add a second constructor with same signature as this .
So i decided to default in functions.scala. I also didnt see similar pattern in a different function, so i am not very sure either.
Referencing a local constant in this class also seemed a bit weird, since that would be very specific.

I'd recommend we remove these two theta_sketch_agg(columnName: String, family: String) and theta_sketch_agg(e: Column, family: String) completely. If we later add another argument that is a of String type, we'll have function overloading ambiguity. Therefore, If a user wants to use a specific family, they must also pass in the lgNomEntries explicitly. It will keep this file cleaner and avoid magic numbers. This will mean we also need to fix both builtin.py files to avoid the same phenomenon. I'm open to seeing what others have to say about this too.

cboumalh · 2025-10-15T00:32:59Z

python/pyspark/sql/connect/functions/builtin.py

 def theta_sketch_agg(
    col: "ColumnOrName",
    lgNomEntries: Optional[Union[int, Column]] = None,
+    family: Optional[str] = None,


For the sake of consistency, we can consider changing this function to look like this:

def theta_sketch_agg( col: "ColumnOrName", lgNomEntries: Optional[Union[int, Column]] = None, family: Optional["ColumnOrName"] = None, ) -> Column: fn = "theta_sketch_agg" _lgNomEntries = lit(12) if lgNomEntries is None else lit(lgNomEntries) _family = lit("QUICKSELECT") if family is None else _to_col(family) return _invoke_function_over_columns(fn, col, _lgNomEntries, _family)

Similarly in the other builtin.py

cboumalh · 2025-10-15T00:59:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ThetaSketchUtils.scala

+  /*
+   * QUICKSELECT is optimized for speed and is the default choice for most use cases,
+   * providing faster updates and queries with slightly higher error rates. ALPHA offers
+   * better accuracy with slightly higher resource consumption, making it suitable when
+   * precision is more important than performance. The choice primarily affects the speed
+   * vs accuracy trade-off.


nit, but this is not entirely true. consider this:

/* * ALPHA is optimized for speed and offers slightly better initial accuracy * (lower error) for simple updates. Its estimation * precision reverts to the standard level if merged with other sketches. * QUICKSELECT is the default and more flexible choice, providing the standard * level of accuracy and full support for all set operations (Union, Intersection, etc.). */

cboumalh · 2025-10-15T01:07:56Z

...main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala

+
+  override protected def withNewChildrenInternal(newFirst: Expression, newSecond: Expression,
+     newThird: Expression): Expression = copy(newFirst, newSecond,
+    newThird, mutableAggBufferOffset, inputAggBufferOffset)


super nit, the withNewChildrenInternal override doesn’t need to explicitly pass mutableAggBufferOffset or inputAggBufferOffset. This is redundant. Return type should also be ThetaSketchAgg. Lastly, can consider adding it above on line 146 to group it with the rest

can have something like this:

override protected def withNewChildrenInternal( newFirst: Expression, newSecond: Expression, newThird: Expression): ThetaSketchAgg = copy( first = newFirst, second = newSecond, third = newThird) }

cboumalh · 2025-10-15T01:10:01Z

Just a few comments left! Thanks @karuppayya!

github-actions bot added SQL PYTHON labels Oct 8, 2025

karuppayya force-pushed the SPARK-53848 branch from d59f6df to c1d4fe8 Compare October 8, 2025 23:29

Add ability to support Aplpha sketches

13aa686

karuppayya force-pushed the SPARK-53848 branch from c1d4fe8 to 13aa686 Compare October 9, 2025 00:17

github-actions bot added the CONNECT label Oct 9, 2025

Fix test failures

47ef17e

karuppayya force-pushed the SPARK-53848 branch from f2f3cb6 to 47ef17e Compare October 10, 2025 01:32

cboumalh reviewed Oct 10, 2025

View reviewed changes

Address rview comments

9b7654a

karuppayya force-pushed the SPARK-53848 branch from 6894ab8 to 9b7654a Compare October 13, 2025 06:49

Address review comments

68feaf4

karuppayya force-pushed the SPARK-53848 branch from 37029cf to 68feaf4 Compare October 14, 2025 19:48

karuppayya requested a review from cboumalh October 14, 2025 19:48

cboumalh reviewed Oct 15, 2025

View reviewed changes

[SPARK-53848] Add ability to support Alpha family in Theta Aggregates #52551

Are you sure you want to change the base?

[SPARK-53848] Add ability to support Alpha family in Theta Aggregates #52551

Uh oh!

Conversation

karuppayya commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

karuppayya commented Oct 10, 2025

Uh oh!

cboumalh commented Oct 10, 2025

Uh oh!

karuppayya commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cboumalh commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karuppayya commented Oct 10, 2025

Uh oh!

cboumalh commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cboumalh Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karuppayya Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

cboumalh Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

cboumalh Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

cboumalh Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cboumalh Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cboumalh commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karuppayya commented Oct 8, 2025 •

edited

Loading

karuppayya commented Oct 10, 2025 •

edited

Loading

cboumalh commented Oct 10, 2025 •

edited

Loading

cboumalh commented Oct 10, 2025 •

edited

Loading

cboumalh Oct 10, 2025 •

edited

Loading

cboumalh Oct 15, 2025 •

edited

Loading

cboumalh Oct 15, 2025 •

edited

Loading