Skip to content

[datafusion-spark] Implement factorical function #16125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 23, 2025
Merged

Conversation

tlm365
Copy link
Contributor

@tlm365 tlm365 commented May 21, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Implement spark-compatible factorical function

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Tai Le Manh added 2 commits May 21, 2025 08:46
Signed-off-by: Tai Le Manh <tailm2@vingroup.net>
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels May 22, 2025
@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

@tlm365 - I wonder if you would be willing to help pick this PR back up now that we have merged a PR with a bunch of tests from @shehabgamin here:

We are hoping to generate some example PRs to use when starting to fill out the rest of the functions and I was thinking this one might be a good one

@tlm365
Copy link
Contributor Author

tlm365 commented Jun 18, 2025

@tlm365 - I wonder if you would be willing to help pick this PR back up now that we have merged a PR with a bunch of tests from @shehabgamin here:

* [chore: generate basic spark function tests #16409](https://github.com/apache/datafusion/pull/16409)

@alamb Sure! I'd be happy to help pick this PR back. Thanks for pointing me to the recent test additions — that will definitely help.

@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Jun 18, 2025
@tlm365 tlm365 marked this pull request as draft June 18, 2025 08:25
@tlm365 tlm365 marked this pull request as ready for review June 19, 2025 16:47
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tlm365 🙏

Can you please update the tests in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/spark/math/factorial.slt to run (it should work great after this PR)

Thank you!

&self.aliases
}

fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to implement coerce_types manually? I think the signature should handle this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we actually don't need it after all. I've updated it and added the unit tests as well. TYSM!

@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

FYI @shehabgamin

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 21, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tlm365 -- this looks great to me

What do you think @shehabgamin -- anything else that this PR should have (I am hoping to use it as an example when porting other functions)

Int64 => {
let array = as_int64_array(array)?;

let result: Int64Array = array.iter().map(compute_factorial).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit is that you can potentially use https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary to generate faster code but I don't think that is required

@shehabgamin
Copy link
Contributor

Thank you @tlm365 -- this looks great to me

What do you think @shehabgamin -- anything else that this PR should have (I am hoping to use it as an example when porting other functions)

@alamb @tlm365 I will review this in the next couple of hours!

Copy link
Contributor

@shehabgamin shehabgamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlm365 Well done! Just a couple minor comments, and after those are addressed, this PR will be in great shape.

Btw, I added thorough comments in the .slt file to provide context about where these tests come from, since this PR will serve as a reference point for future PRs.

cc @alamb

#query
#SELECT factorial(5::int);
query I
SELECT factorial(5::INT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


query I
SELECT factorial(a) from VALUES (0::INT), (20::INT), (21::INT), (NULL) AS t(a);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SELECT factorial(a) from VALUES (0::INT), (20::INT), (21::INT), (NULL) AS t(a);
SELECT factorial(a) from VALUES (-1::INT), (0::INT), (1::INT), (2::INT), (3::INT), (4::INT), (5::INT), (6::INT), (7::INT), (8::INT), (9::INT), (10::INT), (11::INT), (12::INT), (13::INT), (14::INT), (15::INT), (16::INT), (17::INT), (18::INT), (19::INT), (20::INT), (21::INT), (NULL) AS t(a);

For anyone wondering, this test comes from here:
https://docs.databricks.com/aws/en/sql/language-manual/functions/factorial

And the url above comes from the testing guide:
https://github.com/apache/datafusion/blob/85eebcd25dfbe8e2d2d75d85b8683de8be4851e8/datafusion/sqllogictest/test_files/spark/README.md#finding-test-cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impl SparkFactorial {
pub fn new() -> Self {
Self {
signature: Signature::uniform(1, vec![Int64], Volatility::Immutable),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input should be Int32, according to these docs:
https://docs.databricks.com/aws/en/sql/language-manual/functions/factorial

I'm not exactly sure how the .slt tests passed, since the input is specifically Int32 in those tests. There may be some coercion behind the scenes perhaps.

To opt out of any potential implicit coercion, you can use Signature::user_defined and do something like this:

fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> {
    if arg_types.len() != 1 {
        return exec_err!(
            "`factorial` function requires 1 argument, got {}",
            arg_types.len()
        );
    }
    match arg_types[0] {
        DataType::Int32 => Ok(vec![DataType::Int32]),
        _ => exec_err!(
            "`factorial` function does not support type {}",
            arg_types[0]
        ),
    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure how the .slt tests passed, since the input is specifically Int32 in those tests. There may be some coercion behind the scenes perhaps.

Yes I think the DataFusion coercion rules come into effect. DataFusion will attempt to automatically coerce arguments

I would personally suggest avoiding adding additional error checking as that will result in potentially confusing error messages. Maybe we can use https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.TypeSignature.html#variant.Exact to specify the types.

I don't have any great ideas of how we could have automatically figured out this type discrepancy with tests however 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shehabgamin @alamb Thanks so much for reviewing. I've update the signature's type and tests as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I think the DataFusion signatures just need to be more explicit with what they are doing.

@alamb
Copy link
Contributor

alamb commented Jun 23, 2025

gogogogogogo

THanks again @shehabgamin and @tlm365

@alamb alamb merged commit 9556bcd into apache:main Jun 23, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spark sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[datafusion-spark] Implement factorial function
3 participants