Skip to content

Incorrect schema used when using time-travel #11162

Closed as not planned
Closed as not planned
@ghost

Description

Apache Iceberg version

1.5.0

Query engine

Spark

Please describe the bug 🐞

When using time travel to retrieve a previous version of a table via a snapshot ID, the table’s schema is used instead of the snapshot's schema, contrary to the documentation.

Reproduction code:

# Create the table
spark_session.sql(f"CREATE TABLE iceberg_test (id bigint, data string, col float)")

# Populate the table
spark_session.sql(f"INSERT INTO iceberg_test values (1, 'a', 1.0), (2, 'b', 2.0), (3, 'c', 3.0)")

# Rename 'col' to 'value'
spark_session.sql(f"ALTER TABLE iceberg_test RENAME COLUMN col TO value")

# Insert a new row
spark_session.sql(f"INSERT INTO iceberg_test values (4, 'd', 4.0)")

# Time-travel to the first snapshot_id provided by iceberg_test.snapshots
snapshot_1 = spark_session.sql(f"SELECT * FROM iceberg_test VERSION AS OF <INSERT SNAPSHOT ID>")

# Operation on the renamed field
snapshot_1.filter("col == 2.0").show()

We end up with the following error:

Py4JJavaError: An error occurred while calling o111.showString.
: org.apache.iceberg.exceptions.ValidationException: Cannot find field 'col' in struct: struct<1: id: optional long, 2: data: optional string, 3: value: optional float>

NOTES:

  • snapshot_1.printSchema() would confirm that the field name is col and not value, as per the last snapshot
  • The error also occurs when using the Spark DataFrame API

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions