Skip to content

Conversation

xianzhe-databricks
Copy link

@xianzhe-databricks xianzhe-databricks commented Sep 26, 2025

What changes were proposed in this pull request?

Currently, BinaryType is mapped inconsistently in PySpark:

Cases when it is mapped to bytearray:

  1. regular UDF without arrow optimization.
  2. regular UDF with arrow optimization, and without legacy pandas conversion.
  3. Dataframe APIs, e.g. df.collect(), df.toLocalIterator(), df.foreachPartition(), both classic and spark connect
  4. Data source read & write.

Cases when it is mapped to bytes:
regular UDF with arrow optimization and legacy pandas conversion.

This complicates the data mapping model. With this PR, BinaryType will be consistently mapped to bytes in all aforementioned cases.
We gate the change with a SQL Conf, and enable the conversion to bytes by default.

This PR is based on #52370

Why are the changes needed?

bytes is more efficient as it is immutable and requires zero copy.

Does this PR introduce any user-facing change?

Yes. For the aforementioned cases where BinaryType is mapped to bytearray, we changed the mapping to bytes.

How was this patch tested?

Many tests.

Was this patch authored or co-authored using generative AI tooling?

Yes, with the help of claude code.

safecheck,
input_types,
int_to_decimal_coercion_enabled=False,
int_to_decimal_coercion_enabled,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default value for int_to_decimal_coercion_enabled is not used at all

@github-actions github-actions bot added the DOCS label Sep 29, 2025
@xianzhe-databricks xianzhe-databricks changed the title [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark UDF [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark arrow UDF Sep 29, 2025
@github-actions github-actions bot added the BUILD label Sep 29, 2025
Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add this change to the migration guide?

@xianzhe-databricks xianzhe-databricks changed the title [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark arrow UDF [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark Sep 30, 2025
@github-actions github-actions bot removed the BUILD label Sep 30, 2025
@xianzhe-databricks xianzhe-databricks marked this pull request as ready for review September 30, 2025 15:06
@xianzhe-databricks
Copy link
Author

Can we also add this change to the migration guide?

where shall I add it? I found the pyspark migration guide is archived https://github.com/apache/spark/blob/master/docs/pyspark-migration-guide.md

@xianzhe-databricks xianzhe-databricks changed the title [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark [SPARK-53696][PYTHON][CONNECT]Default to bytes for BinaryType in PySpark Oct 2, 2025
@xianzhe-databricks xianzhe-databricks changed the title [SPARK-53696][PYTHON][CONNECT]Default to bytes for BinaryType in PySpark [SPARK-53696][PYTHON][CONNECT][SQL]Default to bytes for BinaryType in PySpark Oct 2, 2025
Comment on lines +51 to +58
@unittest.skip("Duplicate test as it is already tested in ArrowPythonUDFLegacyTests.")
def test_udf_binary_type(self):
super().test_udf_binary_type(self)

@unittest.skip("Duplicate test as it is already tested in ArrowPythonUDFLegacyTests.")
def test_udf_binary_type_in_nested_structures(self):
super().test_udf_binary_type_in_nested_structures(self)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we add tests then skip them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class ArrowPythonUDFParityLegacyTestsMixin extends from ArrowPythonUDFTestsMixin and ArrowPythonUDFTestsMixin already contains this test.
We already run this test in test_arrow_python_udf.py. It is not meaningful to re-run a test. We can marginally save some test resources.

schema: StructType,
*,
return_as_tuples: bool = False,
binary_as_bytes: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be controlled by flag?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is already controlled by a flag, as binary_as_bytes is passed at caller's place as the value of the SQL conf spark.sql.execution.pyspark.binaryAsBytes.

It is not possible, or against the style, to access the SQL conf in this conversion.py

@xianzhe-databricks
Copy link
Author

@ueshin @Yicong-Huang do you have other concerns? Thanks a lot!
I'll fix the linter once we are fine with the change on business code.

@HyukjinKwon HyukjinKwon changed the title [SPARK-53696][PYTHON][CONNECT][SQL]Default to bytes for BinaryType in PySpark [SPARK-53696][PYTHON][CONNECT][SQL] Default to bytes for BinaryType in PySpark Oct 15, 2025
@HyukjinKwon
Copy link
Member

Let's fix up the linter failure and merge

xianzhe-databricks and others added 2 commits October 15, 2025 09:25
Add type: ignore[misc] comments to suppress mypy errors about
"None not callable" in converter functions. These comments are
needed because mypy cannot infer that converters are callable
after None checks when the binary_as_bytes parameter is added
to the type signature.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add type ignore comment to suppress mypy error about "None" not callable
in ArrowTableToRowsConversion._create_converter. The field_convs[i] can be
None, but the conditional expression structure causes mypy to analyze the
call before the None check.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@xianzhe-databricks
Copy link
Author

I did another pass and fixed the linter error. The PR is good for merge from my point of view. @HyukjinKwon @ueshin @allisonwang-db thank you!

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a migration guide?

Otherwise, LGTM.

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!


val PYSPARK_BINARY_AS_BYTES =
buildConf("spark.sql.execution.pyspark.binaryAsBytes")
.doc("When true, BinaryType is mapped consistently to bytes in PySpark." +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Mapped" here means the function input will be mapped as bytes right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly

for idx, field in enumerate(result_df.schema.fields):
self.assertEqual(field.dataType, expected_output_types[idx])

def test_udtf_binary_type(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add one more test for test_arrow_udtf?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what test did you mean? The same test case but applied for arrow udtf? See another thread, as the class for arrow udtf inherits the class for regular udtf, it is already tested

for conf_value in ["true", "false"]:
with self.sql_conf({"spark.sql.execution.pyspark.binaryAsBytes": conf_value}):
result = BinaryTypeUDTF(lit(b"test")).collect()
self.assertEqual(result[0]["type_name"], "bytes")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allisonwang-db arrow udtf with legacy conversion is tested here

def test_udtf_binary_type(self):
# For Arrow Python UDTF with non-legacy conversionBinaryType is mapped to
# bytes or bytearray consistently with non-Arrow Python UDTF behavior.
BaseUDTFTestsMixin.test_udtf_binary_type(self)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allisonwang-db arrow udtf without legacy conversion is tested here

DataFrame APIs (both Spark Classic and Spark Connect) ``bytearray``
Data sources ``bytearray``
Arrow-optimized UDF and UDTF with unnecessary conversion to pandas instances ``bytes``
=============================================================================== ==============================
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin @allisonwang-db migration guide is added!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants