[SPARK-51206][PYTHON][CONNECT] Move Arrow conversion helpers out of Spark Connect #49941

wengh · 2025-02-13T20:30:34Z

What changes were proposed in this pull request?

Refactor pyspark.sql.connect.conversion to move LocalDataToArrowConversion and ArrowTableToRowsConversion into pyspark.sql.conversion.

The reason is that pyspark.sql.connect.conversion checks for Spark Connect dependencies such as pyarrow, grpcio and pandas, but LocalDataToArrowConversion and ArrowTableToRowsConversion only need pyarrow.

pyspark.sql.connect.conversion still re-exports the two classes for backward compatibility.

Why are the changes needed?

Python Data Sources should work without Spark Connect dependencies but currently it imports LocalDataToArrowConversion and ArrowTableToRowsConversion from pyspark.sql.connect.conversion making it require unnecessary dependencies. This change moves these two classes to pyspark.sql.conversion so that Python Data Sources runs without Spark Connect dependencies.

Does this PR introduce any user-facing change?

Relaxed requirements for using Python Data Sources.

How was this patch tested?

Existing tests should make sure that the changes don't break anything.

Manually tested to ensure that Python Data Sources can run without grpcio and pandas.

Was this patch authored or co-authored using generative AI tooling?

No

…park connect module

python/pyspark/sql/conversion.py

allisonwang-db · 2025-02-14T22:08:52Z

cc @HyukjinKwon

zhengruifeng · 2025-02-17T02:17:47Z

python/pyspark/sql/conversion.py

+        elif isinstance(dataType, ArrayType):
+            return ArrowTableToRowsConversion._need_converter(dataType.elementType)
+        elif isinstance(dataType, MapType):
+            # Different from PySpark, here always needs conversion,


I remember there is another version of conversion for PySpark classic, do we plan to unify them in the future?

Are you referring to pyspark.sql.types._create_converter? Seems like there are some differences, for example ArrowTableToRowsConversion handles UDT and variant but the one in types.py doesn't.

Can we check where _create_converteris used and see if we can merge these two methods?

That _create_converter turns dicts (StructType) into tuples of values. It has nothing to do with arrow conversion. Sorry for the confusion 🤦

actually, I meant python/pyspark/sql/pandas/conversion.py. We can investigate it later.

python/pyspark/sql/connect/conversion.py

python/pyspark/sql/conversion.py

python/pyspark/sql/worker/plan_data_source_read.py

HyukjinKwon · 2025-02-27T02:37:43Z

Merged to master and branch-4.0.

…park Connect ### What changes were proposed in this pull request? Refactor `pyspark.sql.connect.conversion` to move `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` into `pyspark.sql.conversion`. The reason is that `pyspark.sql.connect.conversion` checks for Spark Connect dependencies such as `pyarrow`, `grpcio` and `pandas`, but `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` only need `pyarrow`. `pyspark.sql.connect.conversion` still re-exports the two classes for backward compatibility. ### Why are the changes needed? Python Data Sources should work without Spark Connect dependencies but currently it imports `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` from `pyspark.sql.connect.conversion` making it require unnecessary dependencies. This change moves these two classes to `pyspark.sql.conversion` so that Python Data Sources runs without Spark Connect dependencies. ### Does this PR introduce _any_ user-facing change? Relaxed requirements for using Python Data Sources. ### How was this patch tested? Existing tests should make sure that the changes don't break anything. Manually tested to ensure that Python Data Sources can run without grpcio and pandas. ### Was this patch authored or co-authored using generative AI tooling? No Closes #49941 from wengh/spark-51206-pyds-fix-dependency. Authored-by: Haoyu Weng <wenghy02@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 727167a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…park Connect ### What changes were proposed in this pull request? Refactor `pyspark.sql.connect.conversion` to move `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` into `pyspark.sql.conversion`. The reason is that `pyspark.sql.connect.conversion` checks for Spark Connect dependencies such as `pyarrow`, `grpcio` and `pandas`, but `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` only need `pyarrow`. `pyspark.sql.connect.conversion` still re-exports the two classes for backward compatibility. ### Why are the changes needed? Python Data Sources should work without Spark Connect dependencies but currently it imports `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` from `pyspark.sql.connect.conversion` making it require unnecessary dependencies. This change moves these two classes to `pyspark.sql.conversion` so that Python Data Sources runs without Spark Connect dependencies. ### Does this PR introduce _any_ user-facing change? Relaxed requirements for using Python Data Sources. ### How was this patch tested? Existing tests should make sure that the changes don't break anything. Manually tested to ensure that Python Data Sources can run without grpcio and pandas. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#49941 from wengh/spark-51206-pyds-fix-dependency. Authored-by: Haoyu Weng <wenghy02@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

[SPARK-51206][PYTHON][CONNECT] Move Arrow conversion helpers out of S…

5439b4f

…park connect module

github-actions bot added SQL PYTHON CONNECT labels Feb 13, 2025

ueshin requested a review from allisonwang-db February 14, 2025 01:22

allisonwang-db reviewed Feb 14, 2025

View reviewed changes

python/pyspark/sql/conversion.py Outdated Show resolved Hide resolved

python/pyspark/sql/conversion.py Outdated Show resolved Hide resolved

python/pyspark/sql/conversion.py Outdated Show resolved Hide resolved

add arrow check and fix docstring

375f2a4

wengh requested a review from allisonwang-db February 14, 2025 23:04

zhengruifeng reviewed Feb 17, 2025

View reviewed changes

HyukjinKwon reviewed Feb 18, 2025

View reviewed changes

python/pyspark/sql/connect/conversion.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Feb 18, 2025

View reviewed changes

python/pyspark/sql/conversion.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Feb 18, 2025

View reviewed changes

python/pyspark/sql/worker/plan_data_source_read.py Outdated Show resolved Hide resolved

wengh added 3 commits February 18, 2025 15:57

undo format change

76f569d

delay pyarrow import until actually needed

04e5bdf

remove reexport

2801638

wengh requested review from HyukjinKwon and zhengruifeng February 19, 2025 00:49

zhengruifeng approved these changes Feb 24, 2025

View reviewed changes

HyukjinKwon approved these changes Feb 24, 2025

View reviewed changes

allisonwang-db approved these changes Feb 27, 2025

View reviewed changes

HyukjinKwon closed this in 727167a Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-51206][PYTHON][CONNECT] Move Arrow conversion helpers out of Spark Connect #49941

[SPARK-51206][PYTHON][CONNECT] Move Arrow conversion helpers out of Spark Connect #49941

Uh oh!

wengh commented Feb 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allisonwang-db commented Feb 14, 2025

Uh oh!

zhengruifeng Feb 17, 2025

Uh oh!

wengh Feb 18, 2025 •

edited

Loading

Uh oh!

allisonwang-db Feb 18, 2025

Uh oh!

wengh Feb 19, 2025

Uh oh!

zhengruifeng Feb 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Feb 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-51206][PYTHON][CONNECT] Move Arrow conversion helpers out of Spark Connect #49941

[SPARK-51206][PYTHON][CONNECT] Move Arrow conversion helpers out of Spark Connect #49941

Uh oh!

Conversation

wengh commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allisonwang-db commented Feb 14, 2025

Uh oh!

zhengruifeng Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

wengh Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

wengh Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Feb 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wengh commented Feb 13, 2025 •

edited

Loading

wengh Feb 18, 2025 •

edited

Loading