Skip to content

Conversation

wengh
Copy link
Contributor

@wengh wengh commented Feb 13, 2025

What changes were proposed in this pull request?

Refactor pyspark.sql.connect.conversion to move LocalDataToArrowConversion and ArrowTableToRowsConversion into pyspark.sql.conversion.

The reason is that pyspark.sql.connect.conversion checks for Spark Connect dependencies such as pyarrow, grpcio and pandas, but LocalDataToArrowConversion and ArrowTableToRowsConversion only need pyarrow.

pyspark.sql.connect.conversion still re-exports the two classes for backward compatibility.

Why are the changes needed?

Python Data Sources should work without Spark Connect dependencies but currently it imports LocalDataToArrowConversion and ArrowTableToRowsConversion from pyspark.sql.connect.conversion making it require unnecessary dependencies. This change moves these two classes to pyspark.sql.conversion so that Python Data Sources runs without Spark Connect dependencies.

Does this PR introduce any user-facing change?

Relaxed requirements for using Python Data Sources.

How was this patch tested?

Existing tests should make sure that the changes don't break anything.

Manually tested to ensure that Python Data Sources can run without grpcio and pandas.

Was this patch authored or co-authored using generative AI tooling?

No

@allisonwang-db
Copy link
Contributor

cc @HyukjinKwon

@wengh wengh requested a review from allisonwang-db February 14, 2025 23:04
elif isinstance(dataType, ArrayType):
return ArrowTableToRowsConversion._need_converter(dataType.elementType)
elif isinstance(dataType, MapType):
# Different from PySpark, here always needs conversion,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there is another version of conversion for PySpark classic, do we plan to unify them in the future?

Copy link
Contributor Author

@wengh wengh Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to pyspark.sql.types._create_converter? Seems like there are some differences, for example ArrowTableToRowsConversion handles UDT and variant but the one in types.py doesn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check where _create_converteris used and see if we can merge these two methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That _create_converter turns dicts (StructType) into tuples of values. It has nothing to do with arrow conversion. Sorry for the confusion 🤦

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I meant python/pyspark/sql/pandas/conversion.py. We can investigate it later.

@HyukjinKwon
Copy link
Member

Merged to master and branch-4.0.

HyukjinKwon pushed a commit that referenced this pull request Feb 27, 2025
…park Connect

### What changes were proposed in this pull request?

Refactor `pyspark.sql.connect.conversion` to move `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` into `pyspark.sql.conversion`.

The reason is that `pyspark.sql.connect.conversion` checks for Spark Connect dependencies such as `pyarrow`, `grpcio` and `pandas`, but `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` only need `pyarrow`.

`pyspark.sql.connect.conversion` still re-exports the two classes for backward compatibility.

### Why are the changes needed?

Python Data Sources should work without Spark Connect dependencies but currently it imports `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` from `pyspark.sql.connect.conversion` making it require unnecessary dependencies. This change moves these two classes to `pyspark.sql.conversion` so that Python Data Sources runs without Spark Connect dependencies.

### Does this PR introduce _any_ user-facing change?

Relaxed requirements for using Python Data Sources.

### How was this patch tested?

Existing tests should make sure that the changes don't break anything.

Manually tested to ensure that Python Data Sources can run without grpcio and pandas.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #49941 from wengh/spark-51206-pyds-fix-dependency.

Authored-by: Haoyu Weng <wenghy02@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 727167a)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Pajaraja pushed a commit to Pajaraja/spark that referenced this pull request Mar 6, 2025
…park Connect

### What changes were proposed in this pull request?

Refactor `pyspark.sql.connect.conversion` to move `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` into `pyspark.sql.conversion`.

The reason is that `pyspark.sql.connect.conversion` checks for Spark Connect dependencies such as `pyarrow`, `grpcio` and `pandas`, but `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` only need `pyarrow`.

`pyspark.sql.connect.conversion` still re-exports the two classes for backward compatibility.

### Why are the changes needed?

Python Data Sources should work without Spark Connect dependencies but currently it imports `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` from `pyspark.sql.connect.conversion` making it require unnecessary dependencies. This change moves these two classes to `pyspark.sql.conversion` so that Python Data Sources runs without Spark Connect dependencies.

### Does this PR introduce _any_ user-facing change?

Relaxed requirements for using Python Data Sources.

### How was this patch tested?

Existing tests should make sure that the changes don't break anything.

Manually tested to ensure that Python Data Sources can run without grpcio and pandas.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#49941 from wengh/spark-51206-pyds-fix-dependency.

Authored-by: Haoyu Weng <wenghy02@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants