Add hash-based SUM_ANSI aggregation for INT64 values #19403

PointKernel · 2025-07-16T23:04:24Z

Description

Contributes to #19243

This PR introduces a new aggregation kind, SUM_ANSI, which returns a STRUCT containing the sum and a boolean indicating overflow. If an overflow occurs, the corresponding row in the overflow boolean column will be set to true.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-07-16T23:04:28Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

PointKernel · 2025-07-16T23:06:26Z

/ok to test aa77170

revans2

The API looks great, but I am a bit surprised that this is targeted for 25.08. I thought it was not going to happen until 25.10. I'll need some time to run tests and benchmarks to see if it does what I want/need.

I don't see a corresponding reduction operations, but that is probably okay for now. I have implemented another work around for the time being which has okay performance. Because I have another workaround, I was not expecting this so quickly, it may take me a few days to get all of that done.

davidwendt · 2025-07-17T13:23:15Z

The API looks great, but I am a bit surprised that this is targeted for 25.08. I thought it was not going to happen until 25.10. I'll need some time to run tests and benchmarks to see if it does what I want/need.

The branch-25.10 does not exist yet. Once it does, this will likely move there since burndown starts today.

davidwendt · 2025-07-17T13:28:50Z

cpp/include/cudf/detail/aggregation/aggregation.hpp

+/**
+ * @brief Derived class for specifying a sum_ansi aggregation
+ */
+class sum_ansi_aggregation final : public groupby_aggregation, public groupby_scan_aggregation {


Could we just add an overflow_check member to the existing sum_aggregation and default it to false?
I'm not sure ANSI communicates the behavior here very well.
We have a few examples of aggregations with member data like quantile:

cudf/cpp/include/cudf/detail/aggregation/aggregation.hpp

Line 495 in c33bec4

interpolation _interpolation; ///< Desired interpolation

Thanks for the suggestion. I'm not too familiar with the setup of quantile aggregation, so I'll need to take a closer look. My main concern for now is that the overflow_check parameter is determined at runtime, which influences the output type and, consequently, how the aggregation is performed. Since we typically treat the aggregation kind as a template parameter and specialize for each case, introducing runtime branching could affect performance. Hash-based aggregation is quite register-sensitive, so we’ll need to carefully evaluate the overhead of any runtime dispatching.

Thank you @PointKernel for looking into the parameter suggestion. If there are concrete benefits from introducing a new aggregation kind then that would seem fine to me. We might want to call if "SUM_WITH_OVERFLOW" or something.

+1 for naming the aggregation kind something like "SUM_WITH_OVERFLOW" (or "SUM_WITH_OVERFLOW_CHECK" if we are feeling wordy 😄)
IMO, having this operation as a new aggregation kind makes sense if overflow check is less likely to be turned on than off. Sum is a popular aggregation type, and if there's a performance impact from having the overflow_check parameter determined at runtime, then I'd support having a template specialization for this aggregation kind.

I was picturing the overflow flag could be turned into a template variable internally but that may more complicated to fit in the current design. Also, the different output type is a good argument for a separate aggregation type.

shrshi

A few questions -

shrshi · 2025-07-17T19:20:30Z

cpp/include/cudf/detail/aggregation/aggregation.hpp

@@ -1348,15 +1371,15 @@ struct target_type_impl<Source,

 constexpr bool is_sum_product_agg(aggregation::Kind k)
 {
-  return (k == aggregation::SUM) || (k == aggregation::PRODUCT) ||
+  return (k == aggregation::SUM) || (k == aggregation::SUM_ANSI) || (k == aggregation::PRODUCT) ||


I'm wondering if we need to add the new aggregation kind to this function since we are explicitly disregarding it in the requires clause of target_type_impl

shrshi · 2025-07-17T19:41:08Z

cpp/src/groupby/hash/create_sparse_results_table.cu

+      if (agg == cudf::aggregation::SUM_ANSI) {
+        if (col.size() == 0) {
+          // For empty columns, create empty struct column manually
+          std::vector<std::unique_ptr<cudf::column>> children;


Nit: Can we move the children vector construction into a lambda?

shrshi · 2025-07-17T19:52:56Z

cpp/src/groupby/hash/create_sparse_results_table.cu

+          // Create struct column with the children
+          // For SUM_ANSI, make struct nullable if input has nulls (same as other aggregations)
+          if (nullable) {
+            // Start with ALL_NULL, results will be marked valid during aggregation


I might be missing something basic here, but why do we start with all null for the struct column and its children if the input column is nullable?

shrshi · 2025-07-17T19:54:58Z

cpp/src/groupby/hash/create_sparse_results_table.cu

+            // Start with ALL_NULL, results will be marked valid during aggregation
+            auto null_mask = cudf::create_null_mask(col.size(), cudf::mask_state::ALL_NULL, stream);
+            auto null_count = col.size();  // All null initially
+            return make_structs_column(


Another possibly naive question - can we use create_structs_hierarchy here instead of make_structs_column? We can save on null sanitization costs if possible.

shrshi · 2025-07-17T20:40:10Z

cpp/include/cudf/detail/aggregation/device_aggregators.cuh

+  {
+    // For SUM_ANSI, target is a struct with sum value at child(0) and overflow flag at child(1)
+    auto sum_column      = target.child(0);
+    auto overflow_column = target.child(1);


Question: can we exit early if overflow value at target index is already true? If Spark is either erroring out or returning null on overflow, I'm wondering if we need to compute the entire aggregation result?

ttnghia · 2025-07-21T21:45:15Z

Sorry I have to say that I don't like the idea of having a separate _ANSI aggregation like this. We will need to check overflow not only just for SUM but also other aggs such as MULTIPLY and more. I imagine that having a lot of pairs AGG_NAME and AGG_NAME_ANSI would be very awkward.

Although I don't have a perfect solution for it, I feel that adding an optional overflow_check to the aggregation constructor seems a better solution for the long term.

wence-

The checking for overflow uses UB. For an approach that doesn't do so, and gives you the same information with a branch-free kernel please see (and possible promote to a shared header somewhere) the saturating implementation in src/rolling/detail/range_utils.cuh

wence- · 2025-07-24T10:10:26Z

cpp/include/cudf/aggregation.hpp

@@ -94,6 +94,7 @@ class aggregation {
   */
  enum Kind {
    SUM,              ///< sum reduction
+    SUM_ANSI,         ///< sum reduction with ANSI overflow semantics


nit: link to "ANSI overflow semantics". An external user of this API may well have no idea what this means.

wence- · 2025-07-24T10:16:20Z

cpp/include/cudf/detail/aggregation/device_aggregators.cuh

+    // Check for overflow: if old_sum and source_value have same sign but result has different sign
+    auto const new_sum  = old_sum + source_value;
+    auto const overflow = ((old_sum > 0 && source_value > 0 && new_sum < 0) ||
+                           (old_sum < 0 && source_value < 0 && new_sum > 0));


issue: signed integer overflow is UB, so the compiler has no obligation to generate code for these exceptional cases, and may (will in some future version) optimise this to auto const overflow = false.

wence- · 2025-07-24T10:18:32Z

cpp/tests/groupby/sum_tests.cpp

+    9223372036854775800L,  // Close to INT64_MAX
+    9223372036854775800L,  // Close to INT64_MAX
+    1L,                    // Small value
+    400L};                 // Small value


Please add some test conditions that produce overflow in the "other" direction.

wence- · 2025-07-24T10:19:54Z

cpp/src/groupby/groupby.cu

+      // SUM_ANSI returns a struct with sum (int64_t) and overflow (bool) children
+      std::vector<std::unique_ptr<cudf::column>> children;
+      children.push_back(make_empty_column(cudf::data_type{cudf::type_id::INT64}));
+      children.push_back(make_empty_column(cudf::data_type{cudf::type_id::BOOL8}));
+      return make_structs_column(0, std::move(children), 0, {}, stream, mr);


What input types are supported for sum_ansi? It seems only signed integers. However, I don't see any exceptions reported if an unsupported input type is provided?

wence- · 2025-07-24T10:21:05Z

cpp/src/aggregation/aggregation.cu

+      thrust::fill(rmm::exec_policy_nosync(stream),
+                   sum_col.begin<int64_t>(),
+                   sum_col.end<int64_t>(),
+                   int64_t{0});
+
+      thrust::fill(rmm::exec_policy_nosync(stream),
+                   overflow_col.begin<bool>(),
+                   overflow_col.end<bool>(),
+                   false);


question: do we need two kernels, or can we fill both columns using a zip iterator?

davidwendt · 2025-07-28T11:49:45Z

We will need to check overflow not only just for SUM but also other aggs such as MULTIPLY and more. I imagine that having a lot of pairs AGG_NAME and AGG_NAME_ANSI would be very awkward.

This is another concern in that adding overflow detection to one aggregation opens up doubling libcudf aggregation types to support overflow (or other checks) for all of them.

davidwendt · 2025-07-28T11:50:19Z

Added a comment to the original issue about Spark possibly using the HOST_UDF aggregator type.
#19243 (comment)

PointKernel added 10 commits July 16, 2025 10:40

Add SUM_ANSI and tests

85b255e

This will build

e02904c

Set output type properly

8625228

Fix output creation and update tests

3d4175b

Add early return for all empty inputs

1ae1a46

Fix all null tests

bb61299

Final cleanups

f33801d

Update overflow tests

a8ffb17

Merge remote-tracking branch 'upstream/branch-25.08' into sum-ansi-agg

3ac85c9

Cleanup with C++20 requires

aa77170

github-actions bot assigned PointKernel Jul 16, 2025

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 16, 2025

PointKernel added non-breaking Non-breaking change feature request New feature or request labels Jul 16, 2025

PointKernel mentioned this pull request Jul 16, 2025

Add SUM_ANSI aggregation for INT64 #19282

Closed

3 tasks

revans2 approved these changes Jul 17, 2025

View reviewed changes

davidwendt reviewed Jul 17, 2025

View reviewed changes

GregoryKimball added this to libcudf Jul 17, 2025

GregoryKimball moved this to Burndown in libcudf Jul 17, 2025

GregoryKimball requested a review from shrshi July 17, 2025 18:03

shrshi reviewed Jul 17, 2025

View reviewed changes

GregoryKimball moved this from Burndown to Slip in libcudf Jul 23, 2025

wence- requested changes Jul 24, 2025

View reviewed changes

PointKernel changed the base branch from branch-25.08 to branch-25.10 July 25, 2025 18:28

Add hash-based SUM_ANSI aggregation for INT64 values #19403

Are you sure you want to change the base?

Add hash-based SUM_ANSI aggregation for INT64 values #19403

Uh oh!

Conversation

PointKernel commented Jul 16, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented Jul 16, 2025

Uh oh!

PointKernel commented Jul 16, 2025

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

davidwendt commented Jul 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shrshi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ttnghia commented Jul 21, 2025

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidwendt commented Jul 28, 2025

Uh oh!

davidwendt commented Jul 28, 2025

Uh oh!

Uh oh!