Skip to content

[SPARK-52904][PYTHON] Enable convertToArrowArraySafely by default #51596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

benrobby
Copy link

@benrobby benrobby commented Jul 21, 2025

What changes were proposed in this pull request?

  • this enables spark.sql.execution.pandas.convertToArrowArraySafely by default
  • I am also adjusting unit tests that previously relied on implicit conversions (truncating nanosecond timestamps to microseconds / loss of precision, int overflows) and now start to fail with the new default.

Why are the changes needed?

  • This change aligns pyspark UDF behavior to ANSI SQL behavior in the rest of Spark. On integer overflow, the standard behavior is to throw an error. Users can and should handle such overflow or truncation cases explicitly.

Does this PR introduce any user-facing change?

  • Yes. This throws errors on int overflow, float truncation, and loss of precision when truncating timestamps. Citing PySpark's upgrade docs:
    =======================================  ================  =========================
    PyArrow version                          Integer overflow  Floating point truncation
    =======================================  ================  =========================
    0.11.0 and below                         Raise error       Silently allows
    > 0.11.0, arrowSafeTypeConversion=false  Silent overflow (returns 0)   Silently allows
    > 0.11.0, arrowSafeTypeConversion=true   Raise error       Raise error
    =======================================  ================  =========================

How was this patch tested?

  • adjusted unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jul 21, 2025
@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-52904][PYTHON] enable convertToArrowArraySafely by default [WIP][SPARK-52904][PYTHON] Enable convertToArrowArraySafely by default Jul 21, 2025
@benrobby benrobby changed the title [WIP][SPARK-52904][PYTHON] Enable convertToArrowArraySafely by default [SPARK-52904][PYTHON] Enable convertToArrowArraySafely by default Jul 23, 2025
"Test in '%s' function was failed." % np_name
) from e
finally:
reset_option("compute.ops_on_diff_frames")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the diff above looks unfortunate, this is merely indented by one more level.

@benrobby
Copy link
Author

@HyukjinKwon @zhengruifeng @asl3 could you take a look?

Copy link
Contributor

@asl3 asl3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a note to our migration guide docs for the user-facing change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants