Skip to content

Reading Parquet files in Spark Iceberg with TIMESTAMP_MILLIS parquet type causes ClassCastException #14430

@ayingsf

Description

@ayingsf

Apache Iceberg version

1.6.1

Query engine

Spark

Please describe the bug 🐞

Issue Summary

  • Spark version 3.5.5
  • Iceberg-spark-runtime-3.5_2.12 version 1.6.1

I'm getting an error when reading an iceberg table with parquet files containing Timestamp fields backed by Parquet TIMESTAMP_MILLIS type. Error I'm getting is:

java.lang.ClassCastException: class org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroTZVector cannot be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector (org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroTZVector and org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector are in unnamed module of loader 'app')

Root cause seems to be that Iceberg is expecting a BigIntVector in its vectorized arrow reader but the actual columnVector created is of type TimeStampMicroTZVector.

The columnVector created (via https://github.com/apache/arrow-java/blob/main/vector/src/main/java/org/apache/arrow/vector/types/pojo/FieldType.java#L107) inherits the arrowType which is defined by Iceberg itself via ArrowSchemaUtil to always have microseconds precision. Hence the columnVector will always have TimeStampMicroTZVector type.

When underlying parquet file has TIMESTAMP_MICROS data type, it takes this path instead which properly casts to TimeStampMicroTZVector (or NTZ depending on IB metadata)

Same read path, when vectorization is turned off via below config, has no errors

spark.sql.iceberg.vectorization.enabled=false

I don't see this logic changing in the latest version of Iceberg so this issue may still exist in latest versions. Why are we expecting a Long type vector in the vectorized reader?

Repro

There was a similar issue that reproduced above ClassCastException: #14046

In general if we generate a Parquet dataset in Spark with timestamp filed of ms precision and add the parquet to an Iceberg table, then read in the table via spark the above error would surface.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions