[SPARK-52355][PYTHON] Infer VariantVal object type as VariantType when creating a DataFrame

austinrwarner · yhuang-db · commit b9232432672f · 2025-06-09T09:54:22.000-07:00
### What changes were proposed in this pull request? When creating a `DataFrame` from Python using `spark.createDataFrame`, infer the type of any `VariantVal` objects as `VariantType`. This is implemented by adding a case mapping `VariantVal` to `VariantType` in the `pyspark.sql.types._infer_type` function. ### Why are the changes needed? Currently, when creating a `DataFrame` that includes locally-instantiated `VariantVal` objects in Python, the type is inferred as `struct<metadata:binary,value:binary>` rather than `VariantType`. This leads to unintended behavior when creating a `DataFrame` locally, or in certain situations like `df.rdd.map(...).toDF` which call `createDataFrame` under the hood. The bug only occurs when the schema of the `DataFrame` is not passed explicitly. ### Does this PR introduce _any_ user-facing change? Yes, fixes the bug described above. ### How was this patch tested? Added a test in `python/pyspark/sql/tests/test_types.py` that checks the inferred type is `VariantType`, as well as ensuring the `VariantVal` has the correct `value` and `metadata` after inference. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51065 from austinrwarner/SPARK-52355. Authored-by: Austin Warner <austin.richard.warner@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
diff --git a/python/pyspark/sql/tests/test_types.py b/python/pyspark/sql/tests/test_types.py
@@ -477,6 +477,25 @@ def test_infer_map_pair_type_with_nested_maps(self):
             df.first(),
         )
 
+    def test_infer_variant_type(self):
+        # SPARK-52355: Test inferring variant type
+        value = VariantVal.parseJson('{"a": 1}')
+
+        data = [Row(f1=value)]
+        df = self.spark.createDataFrame(data)
+        actual = df.first()["f1"]
+
+        self.assertEqual(type(df.schema["f1"].dataType), VariantType)
+        # As of writing VariantVal can also include bytearray
+        self.assertEqual(
+            bytes(actual.value),
+            bytes(value.value),
+        )
+        self.assertEqual(
+            bytes(actual.metadata),
+            bytes(value.metadata),
+        )
+
     def test_create_dataframe_from_dict_respects_schema(self):
         df = self.spark.createDataFrame([{"a": 1}], ["b"])
         self.assertEqual(df.columns, ["b"])
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
@@ -2307,6 +2307,8 @@ def _infer_type(
                 errorClass="UNSUPPORTED_DATA_TYPE",
                 messageParameters={"data_type": f"array({obj.typecode})"},
             )
+    elif isinstance(obj, VariantVal):
+        return VariantType()
     else:
         try:
             return _infer_schema(

Original file line number	Diff line number	Diff line change
`@@ -2307,6 +2307,8 @@ def _infer_type(`
`2307`	`2307`	`errorClass="UNSUPPORTED_DATA_TYPE",`
`2308`	`2308`	`messageParameters={"data_type": f"array({obj.typecode})"},`
`2309`	`2309`	`)`
	`2310`	`+ elif isinstance(obj, VariantVal):`
	`2311`	`+ return VariantType()`
`2310`	`2312`	`else:`
`2311`	`2313`	`try:`
`2312`	`2314`	`return _infer_schema(`