Skip to content

Enhance ScalarUDFImpl Equality Handling with Pointer-Based Default and Customizable Logic #16681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 38 commits into from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Jul 4, 2025

Which issue does this PR close?

Rationale for this change

The previous behavior of ScalarUDFImpl::equals relied on comparing only type IDs, names, and signatures, which could lead to incorrect assumptions about equality—especially for UDFs with internal state.

This change introduces a pointer-based default equality implementation, ensuring distinct instances are not erroneously considered equal. It also encourages users to override equals for more precise behavior when necessary.

What changes are included in this PR?

  • Updated the default implementation of ScalarUDFImpl::equals to check for pointer equality only.
  • Added logic in ScalarUDF's PartialEq to check for Arc::ptr_eq before delegating to equals.
  • Introduced a new test suite in expr/src/test/udf_equals.rs covering various equality scenarios.
  • Added a unit test test_scalar_udf_pointer_equality in projection_pushdown.rs.
  • Documented the change and its implications in upgrading.md, including migration guidance and example implementation.

Are these changes tested?

Yes ✅
Extensive unit tests are provided to validate:

  • Different and same-instance comparisons.
  • Type-based mismatches.
  • Custom UDFs implementing their own equality logic.

Are there any user-facing changes?

Yes ✅

  • Custom UDF authors must now override equals and hash_value if they expect equivalence behavior beyond pointer equality.
  • The change is backward-compatible for users who do not rely on ScalarUDF equality.
  • Detailed upgrade notes are provided in the documentation.

@github-actions github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions labels Jul 4, 2025
}
fn equals(&self, other: &dyn ScalarUDFImpl) -> bool {
if let Some(other) = other.as_any().downcast_ref::<ParamUdf>() {
self.param == other.param && self.type_id() == other.type_id()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional not to compare signature? if ues, let's add a code comment
if no, the ParamUdf could use derive(PartialEq) and then self == other here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
Amended to implement
derive(PartialEq) and then self == other

hash::{Hash, Hasher},
};
#[derive(Debug)]
#[allow(dead_code)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

// if the types, names and signatures are the same, we can't know if they are the same so we
// assume they are not.
// If a UDF has internal state that should be compared, it should implement this method
false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do bit-by-bit comparison of the self & other, if they are the same type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amended comment to clarify why we are not doing
bit-by-bit comparison

        // Alternative approach: we could potentially do bit-by-bit comparison if both objects
        // are the same concrete type, but this requires:
        // 1. Both objects to have identical TypeId
        // 2. Careful handling of potential padding bytes in structs
        // 3. The concrete type to be safely comparable via memcmp
        // For now, we use the conservative approach of returning false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// 1. Both objects to have identical TypeId

we know typeid via as_any(), right?

// 2. Careful handling of potential padding bytes in structs

i don't know if rust does something (zeros) with padding bytes

// 3. The concrete type to be safely comparable via memcmp

rust struct is generally moveable around. i think the move semantics are generally about memcpy-ing bits to a new location, so memcmp-ing bits should be fine.

@@ -62,6 +62,36 @@ DataFusionError::SchemaError(

[#16652]: https://github.com/apache/datafusion/issues/16652

#### `ScalarUDFImpl::equals` Default Implementation

The default implementation of the `equals` method in the `ScalarUDFImpl` trait has been updated. Previously, it compared only the type IDs, names, and signatures of UDFs. Now, it assumes UDFs are not equal unless their pointers are the same.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is safe default. Executing this upgrade suggestion is necessary for common expression elimination to work.
Is it worth considering doing the change more as a breaking change and require explicit upgrade step?

It's more work for sure, but also -- the result is much more predictible.

For example, we could require ScalarUDFImpl to implement PartialEq and delegate to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a thoughtful consideration. You raise valid points about predictability vs. convenience. Let me outline the tradeoffs:

Current approach (safe default with pointer comparison):

✅ Prevents silent bugs where ScalarUDFs with different internal state are incorrectly considered equal
✅ Forces developers to think about equality semantics for their specific UDFs
✅ Common expression elimination works correctly out of the box
❌ Requires manual action from users who want structural equality

Alternative approach (require explicit PartialEq implementation):

✅ More explicit and predictable - no hidden behavior changes
✅ Leverages Rust's type system for compile-time guarantees
✅ Clear contract: if you want equality comparison, implement PartialEq
❌ Breaking change requiring immediate action from all ScalarUDF implementors
❌ More work for simple ScalarUDFs that don't need custom equality logic

My recommendation: Stick with the current approach for DataFusion 49.0.0, but consider the PartialEq approach for a future major version. Here's why:

The current change already enables critical functionality (common expression elimination) without breaking compilation
The upgrade path is clear and documented with examples
For DataFusion 50.0.0 or 51.0.0, we could introduce a more explicit API that requires PartialEq implementation
This gives users time to adapt while ensuring the optimizer works correctly immediately.

What do you think about this phased approach?

/// }
/// fn equals(&self, other: &dyn ScalarUDFImpl) -> bool {
/// if let Some(other) = other.as_any().downcast_ref::<Self>() {
/// self.param == other.param && self.name() == other.name()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not include signature in the equality check?
Worth code comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended to use
derive PartialEq

@@ -16,3 +16,5 @@
// under the License.

pub mod function_stub;
#[cfg(test)]
pub mod udf_equals;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't udf_equals.rs simply go into tests/ folder (datafusion/expr/tests)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.
Moved.

@github-actions github-actions bot added the core Core DataFusion crate label Jul 4, 2025
@kosiew kosiew changed the title Improve ScalarUDFImpl::equals Default Behavior and Add Unit Tests for Custom Equality Enhance ScalarUDFImpl Equality Handling with Pointer-Based Default and Customizable Logic Jul 4, 2025
@kosiew kosiew requested a review from findepi July 12, 2025 12:12
// if the types, names and signatures are the same, we can't know if they are the same so we
// assume they are not.
// If a UDF has internal state that should be compared, it should implement this method
false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// 1. Both objects to have identical TypeId

we know typeid via as_any(), right?

// 2. Careful handling of potential padding bytes in structs

i don't know if rust does something (zeros) with padding bytes

// 3. The concrete type to be safely comparable via memcmp

rust struct is generally moveable around. i think the move semantics are generally about memcpy-ing bits to a new location, so memcmp-ing bits should be fine.

fn equals(&self, other: &dyn ScalarUDFImpl) -> bool {
self.name() == other.name() && self.signature() == other.signature()
// if the pointers are the same, the UDFs are the same
if std::ptr::eq(self.as_any(), other.as_any()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be std::ptr::eq(self, other) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cannot safely use std::ptr::eq(self, other) here because self and other are trait objects (&dyn ScalarUDFImpl).
std::ptr::eq(self, other) would compare the vtable pointers, not the underlying data pointers, and may return false even if the trait objects point to the same data.

@findepi
Copy link
Member

findepi commented Jul 15, 2025

"99%" of the cases are stateless1. Would it be possible to provide default implementation that works great in the "99%" common case, but still works correctly in other cases?

if that's not possible, and so the default behavior is inferior for the common case, it needs to be reimplemented in every function. Which means it's not really an optional method and so we should remove its default implementation.

cc @alamb

Footnotes

  1. the only field in most UDFs is Signature. It's there for convenience. Could be as well implemented as a static

@alamb
Copy link
Contributor

alamb commented Jul 17, 2025

I think this PR is an alternative to the proposed fix in

Is that right? Are you happy with the code in #16781 @kosiew ?

@findepi
Copy link
Member

findepi commented Jul 18, 2025

@alamb #16781 aims to solve immediate obvious problem - incorrect implementation of equality for stateful functions managed in this repo. That PR does not solve #16677.

This PR aims to solve the problem of equality being error-prone, wrong by default. I.e. addresses #16677.
The #16781 PR shows how wide-spread the problem is, giving more motivation to fix the default behavior.

My proposal is

  1. land Implement equals for stateful functions #16781 (or equivalent), i.e. fix the immediate problem
  2. remove default implementation of equality -- this is a breaking change

@alamb
Copy link
Contributor

alamb commented Jul 18, 2025

remove default implementation of equality -- this is a breaking change

As long as we sufficiently document how to fix it (with an example in the upgrade guide) I I think that makes sense to me

Probably in DataFusion 50

@findepi
Copy link
Member

findepi commented Jul 19, 2025

Posted a solution proposal -- #16677 (comment)

@kosiew
Copy link
Contributor Author

kosiew commented Jul 25, 2025

Closing this in favour of #16677 (comment)

@kosiew kosiew closed this Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ScalarUDFImpl::equals default implementation is error-prone
3 participants