-
Notifications
You must be signed in to change notification settings - Fork 257
Description
TL;DR:
When running the plugin with Spark 4+, if a Parquet file is being read with a read-schema that contains wider types than the Parquet file's schema, the read should not fail.
Details:
This is with reference to apache/spark#44368. Spark 4 has the ability to read Parquet files where the read-schema uses wider types than the write-schema in the file.
For instance, a Parquet file with an Integer
column a
should be readable with a read-schema that defines a
as having a type Long
.
Prior to Spark 4, this would yield a `SchemaColumnConvertNotSupportedException on Apache Spark and the plugin. After apache/spark#44368, if the read-schema uses a wider, compatible type, there is an implicit conversion to the wider data type during the read. An incompatible type continues to fail as before.
spark-rapids
's parquet_test.py::test_parquet_check_schema_compatibility
integration test currently looks as follows:
def test_parquet_check_schema_compatibility(spark_tmp_path):
data_path = spark_tmp_path + '/PARQUET_DATA'
gen_list = [('int', int_gen), ('long', long_gen), ('dec32', decimal_gen_32bit)]
with_cpu_session(lambda spark: gen_df(spark, gen_list).coalesce(1).write.parquet(data_path))
read_int_as_long = StructType(
[StructField('long', LongType()), StructField('int', LongType())])
assert_gpu_and_cpu_error(
lambda spark: spark.read.schema(read_int_as_long).parquet(data_path).collect(),
conf={},
error_message='Parquet column cannot be converted')
Spark 4's change in behaviour causes this test to fail thus:
"""
> with pytest.raises(Exception) as excinfo:
E Failed: DID NOT RAISE <class 'Exception'>
../../../../integration_tests/src/main/python/asserts.py:650: Failed