-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Enhance ScalarUDFImpl
Equality Handling with Pointer-Based Default and Customizable Logic
#16681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… comparison logic
…quality rules and provide implementation examples
} | ||
fn equals(&self, other: &dyn ScalarUDFImpl) -> bool { | ||
if let Some(other) = other.as_any().downcast_ref::<ParamUdf>() { | ||
self.param == other.param && self.type_id() == other.type_id() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional not to compare signature? if ues, let's add a code comment
if no, the ParamUdf could use derive(PartialEq) and then self == other
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
Amended to implement
derive(PartialEq) and then self == other
hash::{Hash, Hasher}, | ||
}; | ||
#[derive(Debug)] | ||
#[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
// if the types, names and signatures are the same, we can't know if they are the same so we | ||
// assume they are not. | ||
// If a UDF has internal state that should be compared, it should implement this method | ||
false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do bit-by-bit comparison of the self & other, if they are the same type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amended comment to clarify why we are not doing
bit-by-bit comparison
// Alternative approach: we could potentially do bit-by-bit comparison if both objects
// are the same concrete type, but this requires:
// 1. Both objects to have identical TypeId
// 2. Careful handling of potential padding bytes in structs
// 3. The concrete type to be safely comparable via memcmp
// For now, we use the conservative approach of returning false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// 1. Both objects to have identical TypeId
we know typeid via as_any(), right?
// 2. Careful handling of potential padding bytes in structs
i don't know if rust does something (zeros) with padding bytes
// 3. The concrete type to be safely comparable via memcmp
rust struct is generally moveable around. i think the move semantics are generally about memcpy-ing bits to a new location, so memcmp-ing bits should be fine.
@@ -62,6 +62,36 @@ DataFusionError::SchemaError( | |||
|
|||
[#16652]: https://github.com/apache/datafusion/issues/16652 | |||
|
|||
#### `ScalarUDFImpl::equals` Default Implementation | |||
|
|||
The default implementation of the `equals` method in the `ScalarUDFImpl` trait has been updated. Previously, it compared only the type IDs, names, and signatures of UDFs. Now, it assumes UDFs are not equal unless their pointers are the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is safe default. Executing this upgrade suggestion is necessary for common expression elimination to work.
Is it worth considering doing the change more as a breaking change and require explicit upgrade step?
It's more work for sure, but also -- the result is much more predictible.
For example, we could require ScalarUDFImpl to implement PartialEq and delegate to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a thoughtful consideration. You raise valid points about predictability vs. convenience. Let me outline the tradeoffs:
Current approach (safe default with pointer comparison):
✅ Prevents silent bugs where ScalarUDFs with different internal state are incorrectly considered equal
✅ Forces developers to think about equality semantics for their specific UDFs
✅ Common expression elimination works correctly out of the box
❌ Requires manual action from users who want structural equality
Alternative approach (require explicit PartialEq implementation):
✅ More explicit and predictable - no hidden behavior changes
✅ Leverages Rust's type system for compile-time guarantees
✅ Clear contract: if you want equality comparison, implement PartialEq
❌ Breaking change requiring immediate action from all ScalarUDF implementors
❌ More work for simple ScalarUDFs that don't need custom equality logic
My recommendation: Stick with the current approach for DataFusion 49.0.0, but consider the PartialEq approach for a future major version. Here's why:
The current change already enables critical functionality (common expression elimination) without breaking compilation
The upgrade path is clear and documented with examples
For DataFusion 50.0.0 or 51.0.0, we could introduce a more explicit API that requires PartialEq implementation
This gives users time to adapt while ensuring the optimizer works correctly immediately.
What do you think about this phased approach?
datafusion/expr/src/udf.rs
Outdated
/// } | ||
/// fn equals(&self, other: &dyn ScalarUDFImpl) -> bool { | ||
/// if let Some(other) = other.as_any().downcast_ref::<Self>() { | ||
/// self.param == other.param && self.name() == other.name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not include signature in the equality check?
Worth code comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended to use
derive PartialEq
datafusion/expr/src/test/mod.rs
Outdated
@@ -16,3 +16,5 @@ | |||
// under the License. | |||
|
|||
pub mod function_stub; | |||
#[cfg(test)] | |||
pub mod udf_equals; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't udf_equals.rs simply go into tests/ folder (datafusion/expr/tests
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
Moved.
ScalarUDFImpl::equals
Default Behavior and Add Unit Tests for Custom EqualityScalarUDFImpl
Equality Handling with Pointer-Based Default and Customizable Logic
…ons for improved clarity and correctness
…comparison approach details
Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
…ck in equals method
// if the types, names and signatures are the same, we can't know if they are the same so we | ||
// assume they are not. | ||
// If a UDF has internal state that should be compared, it should implement this method | ||
false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// 1. Both objects to have identical TypeId
we know typeid via as_any(), right?
// 2. Careful handling of potential padding bytes in structs
i don't know if rust does something (zeros) with padding bytes
// 3. The concrete type to be safely comparable via memcmp
rust struct is generally moveable around. i think the move semantics are generally about memcpy-ing bits to a new location, so memcmp-ing bits should be fine.
datafusion/expr/src/udf.rs
Outdated
fn equals(&self, other: &dyn ScalarUDFImpl) -> bool { | ||
self.name() == other.name() && self.signature() == other.signature() | ||
// if the pointers are the same, the UDFs are the same | ||
if std::ptr::eq(self.as_any(), other.as_any()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be std::ptr::eq(self, other)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we cannot safely use std::ptr::eq(self, other) here because self and other are trait objects (&dyn ScalarUDFImpl).
std::ptr::eq(self, other) would compare the vtable pointers, not the underlying data pointers, and may return false even if the trait objects point to the same data.
"99%" of the cases are stateless1. Would it be possible to provide default implementation that works great in the "99%" common case, but still works correctly in other cases? if that's not possible, and so the default behavior is inferior for the common case, it needs to be reimplemented in every function. Which means it's not really an optional method and so we should remove its default implementation. cc @alamb Footnotes
|
…ety and handle padding considerations
@alamb #16781 aims to solve immediate obvious problem - incorrect implementation of equality for stateful functions managed in this repo. That PR does not solve #16677. This PR aims to solve the problem of equality being error-prone, wrong by default. I.e. addresses #16677. My proposal is
|
As long as we sufficiently document how to fix it (with an example in the upgrade guide) I I think that makes sense to me Probably in DataFusion 50 |
Posted a solution proposal -- #16677 (comment) |
Closing this in favour of #16677 (comment) |
Which issue does this PR close?
Rationale for this change
The previous behavior of
ScalarUDFImpl::equals
relied on comparing only type IDs, names, and signatures, which could lead to incorrect assumptions about equality—especially for UDFs with internal state.This change introduces a pointer-based default equality implementation, ensuring distinct instances are not erroneously considered equal. It also encourages users to override
equals
for more precise behavior when necessary.What changes are included in this PR?
ScalarUDFImpl::equals
to check for pointer equality only.ScalarUDF
'sPartialEq
to check forArc::ptr_eq
before delegating toequals
.expr/src/test/udf_equals.rs
covering various equality scenarios.test_scalar_udf_pointer_equality
inprojection_pushdown.rs
.upgrading.md
, including migration guidance and example implementation.Are these changes tested?
Yes ✅
Extensive unit tests are provided to validate:
Are there any user-facing changes?
Yes ✅
equals
andhash_value
if they expect equivalence behavior beyond pointer equality.ScalarUDF
equality.