-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-51206][PYTHON][CONNECT] Move Arrow conversion helpers out of Spark Connect #49941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…park connect module
cc @HyukjinKwon |
elif isinstance(dataType, ArrayType): | ||
return ArrowTableToRowsConversion._need_converter(dataType.elementType) | ||
elif isinstance(dataType, MapType): | ||
# Different from PySpark, here always needs conversion, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember there is another version of conversion for PySpark classic, do we plan to unify them in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to pyspark.sql.types._create_converter
? Seems like there are some differences, for example ArrowTableToRowsConversion
handles UDT and variant but the one in types.py
doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check where _create_converter
is used and see if we can merge these two methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That _create_converter turns dicts (StructType) into tuples of values. It has nothing to do with arrow conversion. Sorry for the confusion 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I meant python/pyspark/sql/pandas/conversion.py
. We can investigate it later.
Merged to master and branch-4.0. |
…park Connect ### What changes were proposed in this pull request? Refactor `pyspark.sql.connect.conversion` to move `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` into `pyspark.sql.conversion`. The reason is that `pyspark.sql.connect.conversion` checks for Spark Connect dependencies such as `pyarrow`, `grpcio` and `pandas`, but `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` only need `pyarrow`. `pyspark.sql.connect.conversion` still re-exports the two classes for backward compatibility. ### Why are the changes needed? Python Data Sources should work without Spark Connect dependencies but currently it imports `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` from `pyspark.sql.connect.conversion` making it require unnecessary dependencies. This change moves these two classes to `pyspark.sql.conversion` so that Python Data Sources runs without Spark Connect dependencies. ### Does this PR introduce _any_ user-facing change? Relaxed requirements for using Python Data Sources. ### How was this patch tested? Existing tests should make sure that the changes don't break anything. Manually tested to ensure that Python Data Sources can run without grpcio and pandas. ### Was this patch authored or co-authored using generative AI tooling? No Closes #49941 from wengh/spark-51206-pyds-fix-dependency. Authored-by: Haoyu Weng <wenghy02@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 727167a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…park Connect ### What changes were proposed in this pull request? Refactor `pyspark.sql.connect.conversion` to move `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` into `pyspark.sql.conversion`. The reason is that `pyspark.sql.connect.conversion` checks for Spark Connect dependencies such as `pyarrow`, `grpcio` and `pandas`, but `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` only need `pyarrow`. `pyspark.sql.connect.conversion` still re-exports the two classes for backward compatibility. ### Why are the changes needed? Python Data Sources should work without Spark Connect dependencies but currently it imports `LocalDataToArrowConversion` and `ArrowTableToRowsConversion` from `pyspark.sql.connect.conversion` making it require unnecessary dependencies. This change moves these two classes to `pyspark.sql.conversion` so that Python Data Sources runs without Spark Connect dependencies. ### Does this PR introduce _any_ user-facing change? Relaxed requirements for using Python Data Sources. ### How was this patch tested? Existing tests should make sure that the changes don't break anything. Manually tested to ensure that Python Data Sources can run without grpcio and pandas. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#49941 from wengh/spark-51206-pyds-fix-dependency. Authored-by: Haoyu Weng <wenghy02@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Refactor
pyspark.sql.connect.conversion
to moveLocalDataToArrowConversion
andArrowTableToRowsConversion
intopyspark.sql.conversion
.The reason is that
pyspark.sql.connect.conversion
checks for Spark Connect dependencies such aspyarrow
,grpcio
andpandas
, butLocalDataToArrowConversion
andArrowTableToRowsConversion
only needpyarrow
.pyspark.sql.connect.conversion
still re-exports the two classes for backward compatibility.Why are the changes needed?
Python Data Sources should work without Spark Connect dependencies but currently it imports
LocalDataToArrowConversion
andArrowTableToRowsConversion
frompyspark.sql.connect.conversion
making it require unnecessary dependencies. This change moves these two classes topyspark.sql.conversion
so that Python Data Sources runs without Spark Connect dependencies.Does this PR introduce any user-facing change?
Relaxed requirements for using Python Data Sources.
How was this patch tested?
Existing tests should make sure that the changes don't break anything.
Manually tested to ensure that Python Data Sources can run without grpcio and pandas.
Was this patch authored or co-authored using generative AI tooling?
No