You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-52355][PYTHON] Infer VariantVal object type as VariantType when creating a DataFrame
### What changes were proposed in this pull request?
When creating a `DataFrame` from Python using `spark.createDataFrame`, infer the type of any `VariantVal` objects as `VariantType`. This is implemented by adding a case mapping `VariantVal` to `VariantType` in the `pyspark.sql.types._infer_type` function.
### Why are the changes needed?
Currently, when creating a `DataFrame` that includes locally-instantiated `VariantVal` objects in Python, the type is inferred as `struct<metadata:binary,value:binary>` rather than `VariantType`. This leads to unintended behavior when creating a `DataFrame` locally, or in certain situations like `df.rdd.map(...).toDF` which call `createDataFrame` under the hood. The bug only occurs when the schema of the `DataFrame` is not passed explicitly.
### Does this PR introduce _any_ user-facing change?
Yes, fixes the bug described above.
### How was this patch tested?
Added a test in `python/pyspark/sql/tests/test_types.py` that checks the inferred type is `VariantType`, as well as ensuring the `VariantVal` has the correct `value` and `metadata` after inference.
### Was this patch authored or co-authored using generative AI tooling?
No
Closesapache#51065 from austinrwarner/SPARK-52355.
Authored-by: Austin Warner <austin.richard.warner@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
0 commit comments