-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[datafusion-spark] Implement factorical
function
#16125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Tai Le Manh <tailm2@vingroup.net>
@tlm365 - I wonder if you would be willing to help pick this PR back up now that we have merged a PR with a bunch of tests from @shehabgamin here: We are hoping to generate some example PRs to use when starting to fill out the rest of the functions and I was thinking this one might be a good one |
@alamb Sure! I'd be happy to help pick this PR back. Thanks for pointing me to the recent test additions — that will definitely help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tlm365 🙏
Can you please update the tests in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/spark/math/factorial.slt to run (it should work great after this PR)
Thank you!
&self.aliases | ||
} | ||
|
||
fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to implement coerce_types
manually? I think the signature should handle this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we actually don't need it after all. I've updated it and added the unit tests as well. TYSM!
FYI @shehabgamin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tlm365 -- this looks great to me
What do you think @shehabgamin -- anything else that this PR should have (I am hoping to use it as an example when porting other functions)
Int64 => { | ||
let array = as_int64_array(array)?; | ||
|
||
let result: Int64Array = array.iter().map(compute_factorial).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit is that you can potentially use https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary to generate faster code but I don't think that is required
@alamb @tlm365 I will review this in the next couple of hours! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#query | ||
#SELECT factorial(5::int); | ||
query I | ||
SELECT factorial(5::INT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect!
For anyone wondering, this test comes from here:
https://spark.apache.org/docs/latest/api/sql/#factorial
And the url above comes from the testing guide:
https://github.com/apache/datafusion/blob/85eebcd25dfbe8e2d2d75d85b8683de8be4851e8/datafusion/sqllogictest/test_files/spark/README.md#finding-test-cases
|
||
query I | ||
SELECT factorial(a) from VALUES (0::INT), (20::INT), (21::INT), (NULL) AS t(a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT factorial(a) from VALUES (0::INT), (20::INT), (21::INT), (NULL) AS t(a); | |
SELECT factorial(a) from VALUES (-1::INT), (0::INT), (1::INT), (2::INT), (3::INT), (4::INT), (5::INT), (6::INT), (7::INT), (8::INT), (9::INT), (10::INT), (11::INT), (12::INT), (13::INT), (14::INT), (15::INT), (16::INT), (17::INT), (18::INT), (19::INT), (20::INT), (21::INT), (NULL) AS t(a); |
For anyone wondering, this test comes from here:
https://docs.databricks.com/aws/en/sql/language-manual/functions/factorial
And the url above comes from the testing guide:
https://github.com/apache/datafusion/blob/85eebcd25dfbe8e2d2d75d85b8683de8be4851e8/datafusion/sqllogictest/test_files/spark/README.md#finding-test-cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last place to check for tests would be:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.factorial.html
But for factorial
, the test in the link is already covered by the test here.
And the url above comes from the testing guide:
https://github.com/apache/datafusion/blob/85eebcd25dfbe8e2d2d75d85b8683de8be4851e8/datafusion/sqllogictest/test_files/spark/README.md#finding-test-cases
impl SparkFactorial { | ||
pub fn new() -> Self { | ||
Self { | ||
signature: Signature::uniform(1, vec![Int64], Volatility::Immutable), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Input should be Int32
, according to these docs:
https://docs.databricks.com/aws/en/sql/language-manual/functions/factorial
I'm not exactly sure how the .slt
tests passed, since the input is specifically Int32
in those tests. There may be some coercion behind the scenes perhaps.
To opt out of any potential implicit coercion, you can use Signature::user_defined
and do something like this:
fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> {
if arg_types.len() != 1 {
return exec_err!(
"`factorial` function requires 1 argument, got {}",
arg_types.len()
);
}
match arg_types[0] {
DataType::Int32 => Ok(vec![DataType::Int32]),
_ => exec_err!(
"`factorial` function does not support type {}",
arg_types[0]
),
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exactly sure how the .slt tests passed, since the input is specifically Int32 in those tests. There may be some coercion behind the scenes perhaps.
Yes I think the DataFusion coercion rules come into effect. DataFusion will attempt to automatically coerce arguments
I would personally suggest avoiding adding additional error checking as that will result in potentially confusing error messages. Maybe we can use https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.TypeSignature.html#variant.Exact to specify the types.
I don't have any great ideas of how we could have automatically figured out this type discrepancy with tests however 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shehabgamin @alamb Thanks so much for reviewing. I've update the signature's type and tests as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I think the DataFusion signatures just need to be more explicit with what they are doing.
gogogogogogo THanks again @shehabgamin and @tlm365 |
Which issue does this PR close?
factorial
function #16124 .Rationale for this change
What changes are included in this PR?
Implement spark-compatible
factorical
functionAre these changes tested?
Yes.
Are there any user-facing changes?
No.