Skip to content

[BUG] Support for wider types in read schemas for Parquet Reads #11512

@mythrocks

Description

@mythrocks

TL;DR:

When running the plugin with Spark 4+, if a Parquet file is being read with a read-schema that contains wider types than the Parquet file's schema, the read should not fail.

Details:

This is with reference to apache/spark#44368. Spark 4 has the ability to read Parquet files where the read-schema uses wider types than the write-schema in the file.

For instance, a Parquet file with an Integer column a should be readable with a read-schema that defines a as having a type Long.

Prior to Spark 4, this would yield a `SchemaColumnConvertNotSupportedException on Apache Spark and the plugin. After apache/spark#44368, if the read-schema uses a wider, compatible type, there is an implicit conversion to the wider data type during the read. An incompatible type continues to fail as before.

spark-rapids's parquet_test.py::test_parquet_check_schema_compatibility integration test currently looks as follows:

def test_parquet_check_schema_compatibility(spark_tmp_path):
    data_path = spark_tmp_path + '/PARQUET_DATA'
    gen_list = [('int', int_gen), ('long', long_gen), ('dec32', decimal_gen_32bit)]
    with_cpu_session(lambda spark: gen_df(spark, gen_list).coalesce(1).write.parquet(data_path))

    read_int_as_long = StructType(
        [StructField('long', LongType()), StructField('int', LongType())])
    assert_gpu_and_cpu_error(
        lambda spark: spark.read.schema(read_int_as_long).parquet(data_path).collect(),
        conf={},
        error_message='Parquet column cannot be converted')

Spark 4's change in behaviour causes this test to fail thus:

        """
>       with pytest.raises(Exception) as excinfo:
E       Failed: DID NOT RAISE <class 'Exception'>

../../../../integration_tests/src/main/python/asserts.py:650: Failed

Metadata

Metadata

Assignees

Labels

Spark 4.0+Spark 4.0+ issuesbugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions