[SPARK-53696][PYTHON][CONNECT][SQL] Default to bytes for BinaryType in PySpark #52467

xianzhe-databricks · 2025-09-26T10:46:26Z

What changes were proposed in this pull request?

Currently, BinaryType is mapped inconsistently in PySpark:

Cases when it is mapped to bytearray:

regular UDF without arrow optimization.
regular UDF with arrow optimization, and without legacy pandas conversion.
Dataframe APIs, e.g. df.collect(), df.toLocalIterator(), df.foreachPartition(), both classic and spark connect
Data source read & write.

Cases when it is mapped to bytes:
regular UDF with arrow optimization and legacy pandas conversion.

This complicates the data mapping model. With this PR, BinaryType will be consistently mapped to bytes in all aforementioned cases.
We gate the change with a SQL Conf, and enable the conversion to bytes by default.

This PR is based on #52370

Why are the changes needed?

bytes is more efficient as it is immutable and requires zero copy.

Does this PR introduce any user-facing change?

Yes. For the aforementioned cases where BinaryType is mapped to bytearray, we changed the mapping to bytes.

How was this patch tested?

Many tests.

Was this patch authored or co-authored using generative AI tooling?

Yes, with the help of claude code.

xianzhe-databricks · 2025-09-26T15:02:25Z

python/pyspark/sql/pandas/serializers.py

        safecheck,
        input_types,
-        int_to_decimal_coercion_enabled=False,
+        int_to_decimal_coercion_enabled,


the default value for int_to_decimal_coercion_enabled is not used at all

python/pyspark/sql/tests/arrow/test_arrow_binary_as_bytes_udf.py

allisonwang-db

Can we also add this change to the migration guide?

docs/sql-ref-datatypes.md

python/pyspark/sql/tests/arrow/test_arrow_python_udtf.py

xianzhe-databricks · 2025-09-30T15:07:46Z

Can we also add this change to the migration guide?

where shall I add it? I found the pyspark migration guide is archived https://github.com/apache/spark/blob/master/docs/pyspark-migration-guide.md

python/pyspark/sql/connect/dataframe.py

python/pyspark/sql/tests/test_udf.py

python/pyspark/sql/tests/test_udtf.py

sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala

python/pyspark/sql/tests/connect/test_connect_collection.py

python/docs/source/tutorial/sql/type_conversions.rst

python/pyspark/sql/connect/dataframe.py

Yicong-Huang · 2025-10-06T22:21:59Z

python/pyspark/sql/tests/connect/arrow/test_parity_arrow_python_udf.py

+    @unittest.skip("Duplicate test as it is already tested in ArrowPythonUDFLegacyTests.")
+    def test_udf_binary_type(self):
+        super().test_udf_binary_type(self)
+
+    @unittest.skip("Duplicate test as it is already tested in ArrowPythonUDFLegacyTests.")
+    def test_udf_binary_type_in_nested_structures(self):
+        super().test_udf_binary_type_in_nested_structures(self)
+


why do we add tests then skip them?

class ArrowPythonUDFParityLegacyTestsMixin extends from ArrowPythonUDFTestsMixin and ArrowPythonUDFTestsMixin already contains this test.
We already run this test in test_arrow_python_udf.py. It is not meaningful to re-run a test. We can marginally save some test resources.

Yicong-Huang · 2025-10-06T22:30:11Z

python/pyspark/sql/conversion.py

+        schema: StructType,
+        *,
+        return_as_tuples: bool = False,
+        binary_as_bytes: bool = True,


maybe this should be controlled by flag?

it is already controlled by a flag, as binary_as_bytes is passed at caller's place as the value of the SQL conf spark.sql.execution.pyspark.binaryAsBytes.

It is not possible, or against the style, to access the SQL conf in this conversion.py

xianzhe-databricks · 2025-10-07T20:17:28Z

@ueshin @Yicong-Huang do you have other concerns? Thanks a lot!
I'll fix the linter once we are fine with the change on business code.

HyukjinKwon · 2025-10-15T03:54:07Z

Let's fix up the linter failure and merge

Add type: ignore[misc] comments to suppress mypy errors about "None not callable" in converter functions. These comments are needed because mypy cannot infer that converters are callable after None checks when the binary_as_bytes parameter is added to the type signature. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add type ignore comment to suppress mypy error about "None" not callable in ArrowTableToRowsConversion._create_converter. The field_convs[i] can be None, but the conditional expression structure causes mypy to analyze the call before the None check. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

xianzhe-databricks · 2025-10-15T15:26:01Z

I did another pass and fixed the linter error. The PR is good for merge from my point of view. @HyukjinKwon @ueshin @allisonwang-db thank you!

ueshin

Shall we add a migration guide?

pyspark_upgrade.rst

Otherwise, LGTM.

allisonwang-db

Looks good!

allisonwang-db · 2025-10-15T19:06:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala


+  val PYSPARK_BINARY_AS_BYTES =
+    buildConf("spark.sql.execution.pyspark.binaryAsBytes")
+      .doc("When true, BinaryType is mapped consistently to bytes in PySpark." +


"Mapped" here means the function input will be mapped as bytes right?

allisonwang-db · 2025-10-15T19:06:59Z

python/pyspark/sql/tests/test_udtf.py

        for idx, field in enumerate(result_df.schema.fields):
            self.assertEqual(field.dataType, expected_output_types[idx])

+    def test_udtf_binary_type(self):


Can we also add one more test for test_arrow_udtf?

what test did you mean? The same test case but applied for arrow udtf? See another thread, as the class for arrow udtf inherits the class for regular udtf, it is already tested

xianzhe-databricks · 2025-10-16T08:35:11Z

python/pyspark/sql/tests/test_udtf.py

+        for conf_value in ["true", "false"]:
+            with self.sql_conf({"spark.sql.execution.pyspark.binaryAsBytes": conf_value}):
+                result = BinaryTypeUDTF(lit(b"test")).collect()
+                self.assertEqual(result[0]["type_name"], "bytes")


@allisonwang-db arrow udtf with legacy conversion is tested here

xianzhe-databricks · 2025-10-16T08:35:17Z

python/pyspark/sql/tests/test_udtf.py

+    def test_udtf_binary_type(self):
+        # For Arrow Python UDTF with non-legacy conversionBinaryType is mapped to
+        # bytes or bytearray consistently with non-Arrow Python UDTF behavior.
+        BaseUDTFTestsMixin.test_udtf_binary_type(self)


@allisonwang-db arrow udtf without legacy conversion is tested here

xianzhe-databricks · 2025-10-16T09:39:17Z

python/docs/source/migration_guide/pyspark_upgrade.rst

+    DataFrame APIs (both Spark Classic and Spark Connect)                            ``bytearray``
+    Data sources                                                                     ``bytearray``
+    Arrow-optimized UDF and UDTF with unnecessary conversion to pandas instances     ``bytes``
+    ===============================================================================  ==============================


@ueshin @allisonwang-db migration guide is added!

first commit

ac5d83c

github-actions bot added SQL CORE PYTHON CONNECT labels Sep 26, 2025

xianzhe-databricks added 3 commits September 26, 2025 11:13

data source related

1e702c9

add conf on connect

659fe4e

add partial tests

e791532

xianzhe-databricks commented Sep 26, 2025

View reviewed changes

HyukjinKwon and others added 6 commits September 29, 2025 08:00

Update python/pyspark/sql/tests/arrow/test_arrow_binary_as_bytes_udf.py

6787bfc

add tests for spark connect

453298d

resolve with remote

c859ce7

lint

09bfef3

fix ci

238e2b7

add docs

8db164c

github-actions bot added the DOCS label Sep 29, 2025

xianzhe-databricks changed the title ~~[SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark UDF~~ [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark arrow UDF Sep 29, 2025

HyukjinKwon reviewed Sep 29, 2025

View reviewed changes

python/pyspark/sql/tests/arrow/test_arrow_binary_as_bytes_udf.py Outdated Show resolved Hide resolved

add tests for arrow python udtf

b10fc9d

github-actions bot added the BUILD label Sep 29, 2025

add a shield

7d3cf56

allisonwang-db reviewed Sep 29, 2025

View reviewed changes

docs/sql-ref-datatypes.md Outdated Show resolved Hide resolved

ueshin reviewed Sep 29, 2025

View reviewed changes

python/pyspark/sql/tests/arrow/test_arrow_python_udtf.py Outdated Show resolved Hide resolved

xianzhe-databricks changed the title ~~[SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark arrow UDF~~ [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark Sep 30, 2025

rename conf' also make classic dataframe API work with the conf

922c1a4

github-actions bot removed the BUILD label Sep 30, 2025

xianzhe-databricks added 2 commits September 30, 2025 12:06

add size

16e8428

apply binary as bytes consistently in all cases

fd5fcb2

xianzhe-databricks marked this pull request as ready for review September 30, 2025 15:06

xianzhe-databricks added 2 commits October 1, 2025 16:30

fix foreach partition

9ce5dfd

add tests for nested structure

e488e3b

ueshin reviewed Oct 1, 2025

View reviewed changes

address comments

4683449

github-actions bot added the STRUCTURED STREAMING label Oct 2, 2025

xianzhe-databricks requested a review from ueshin October 2, 2025 14:20

xianzhe-databricks changed the title ~~[SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark~~ [SPARK-53696][PYTHON][CONNECT]Default to bytes for BinaryType in PySpark Oct 2, 2025

xianzhe-databricks changed the title ~~[SPARK-53696][PYTHON][CONNECT]Default to bytes for BinaryType in PySpark~~ [SPARK-53696][PYTHON][CONNECT][SQL]Default to bytes for BinaryType in PySpark Oct 2, 2025

fix build and simplify tests

852f9d9

ueshin reviewed Oct 6, 2025

View reviewed changes

python/pyspark/sql/tests/connect/test_connect_collection.py Outdated Show resolved Hide resolved

Yicong-Huang reviewed Oct 6, 2025

View reviewed changes

xianzhe-databricks added 2 commits October 7, 2025 20:09

address comments

b933d61

move utils

fa202b1

xianzhe-databricks requested review from Yicong-Huang and ueshin October 7, 2025 20:16

hope to fix mypy

9d0c72e

HyukjinKwon approved these changes Oct 15, 2025

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-53696][PYTHON][CONNECT][SQL]Default to bytes for BinaryType in PySpark~~ [SPARK-53696][PYTHON][CONNECT][SQL] Default to bytes for BinaryType in PySpark Oct 15, 2025

xianzhe-databricks and others added 2 commits October 15, 2025 09:25

ueshin approved these changes Oct 15, 2025

View reviewed changes

allisonwang-db reviewed Oct 15, 2025

View reviewed changes

xianzhe-databricks commented Oct 16, 2025

View reviewed changes

doc

1bcc68e

xianzhe-databricks commented Oct 16, 2025

View reviewed changes

minor change on migration guide

f6d87de

[SPARK-53696][PYTHON][CONNECT][SQL] Default to bytes for BinaryType in PySpark #52467

Are you sure you want to change the base?

[SPARK-53696][PYTHON][CONNECT][SQL] Default to bytes for BinaryType in PySpark #52467

Conversation

xianzhe-databricks commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xianzhe-databricks commented Sep 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xianzhe-databricks commented Oct 7, 2025

Uh oh!

HyukjinKwon commented Oct 15, 2025

Uh oh!

xianzhe-databricks commented Oct 15, 2025

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xianzhe-databricks commented Sep 26, 2025 •

edited

Loading