[SPARK-53732][SQL] Remember TimeTravelSpec in DataSourceV2Relation #52599

aokolnychyi · 2025-10-14T00:40:52Z

What changes were proposed in this pull request?

This PR adds TimeTravelSpec to DataSourceV2Relation when the relation is created by time traveling.

Why are the changes needed?

These changes are needed for subsequent PRs where I will modify Spark to reload certain tables to ensure consistent version scanning and DELETE, UPDATE, and MERGE isolation. Without this change, Spark looses track of whether the relation points to the current version of the table or time travels. As an engine, Spark must be aware whether a relation is the result of time traveling or points to the current table version/snapshot.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR comes with tests.

Was this patch authored or co-authored using generative AI tooling?

No.

aokolnychyi · 2025-10-14T00:42:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

    try f finally { set(originContext) }
  }
+
+  private[sql] def withAnalysisContext[A](context: AnalysisContext)(f: => A): A = {


Needed for testing below.

aokolnychyi · 2025-10-14T00:43:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RelationResolution.scala


      case table =>
        if (isStreaming) {
+          assert(timeTravelSpec.isEmpty, "time travel is not allowed in streaming")


It should be impossible to reach this line with a valid time travel spec. Just a sanity check.

aokolnychyi · 2025-10-14T00:43:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TimeTravelSpec.scala

-case class AsOfTimestamp(timestamp: Long) extends TimeTravelSpec
-case class AsOfVersion(version: String) extends TimeTravelSpec
+case class AsOfTimestamp(timestamp: Long) extends TimeTravelSpec {
+  override def toString: String = s"TIMESTAMP AS OF $timestamp"


Needed for proper simpleString implementation in DataSourceV2Relation. See tests below.

aokolnychyi · 2025-10-14T00:44:12Z

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala


  override def simpleString(maxFields: Int): String = {
-    s"RelationV2${truncatedString(output, "[", ", ", "]", maxFields)} $name"
+    val outputString = truncatedString(output, "[", ", ", "]", maxFields)


Covered with tests.

aokolnychyi · 2025-10-14T00:44:43Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala

    }
  }

+  def pinTable(ident: Identifier, version: String): Unit = {


Used in time travel tests.

aokolnychyi · 2025-10-14T00:47:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

      cascade: Boolean,
      blocking: Boolean): Unit = {
-    uncacheByCondition(spark, _.sameResult(plan), cascade, blocking)
+    EliminateSubqueryAliases(plan) match {


Added this branch to avoid changes in the behavior. Some connectors (like Iceberg) use these methods for their custom commands. I wanted to be on the safer side and keep the old behavior for these calls. That's it, if any of these methods are called with DataSourceV2Relation without time travel spec, we will invalidate all cache entries (including time travel) like before.

so when r.timeTravelSpec.isEmpty, plan1.sameResult(plan2) won't work with dsv2? Could you add comment in the code?

aokolnychyi · 2025-10-14T00:47:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+  /**
+   * Re-caches all cache entries that reference the given table name.
+   */
+  def recacheByTableName(


I will need this in subsequent PRs.

aokolnychyi · 2025-10-14T00:49:39Z

cc @dongjoon-hyun @cloud-fan @gengliangwang @huaxingao @szehon-ho @viirya

dongjoon-hyun · 2025-10-14T01:00:58Z

Thank you for pinging me, @aokolnychyi .

cloud-fan · 2025-10-14T04:52:43Z

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

    identifier: Option[Identifier],
-    options: CaseInsensitiveStringMap)
+    options: CaseInsensitiveStringMap,
+    timeTravelSpec: Option[TimeTravelSpec] = None)


I think it's nice to have this field so that Spark is aware of the version of the table explicitly. But I don't quite understand why it's necessary, as implementation can remember the time travel spec in the v2 Table returned by loadTable with time travel spec. The v2 Table#currentVersion can be used to get the table version explicitly.

One of the use cases that both Iceberg and Delta struggle today is checking that a query uses consistent versions of the table throughout the plan. Having currentVersion is one step but we need to distinguish time travel as it is OK to have different versions in that case. I want Spark to handle these checks and also reload tables to consistent versions whenever that's needed (will be done in subsequent PRs). Today both Iceberg and Delta try to implement this check/reload on their side but it is really tricky in connectors. There are still unhandled edge cases.

Another use case that is even bigger is tracking read sets in DELETE, UPDATE, and MERGE. I have a proposal/PR about a transactional catalog that allows one to capture all operations that happened during an operation for snapshot and serializable isolation. It is also important to track and distinguish time travel there.

Does this make sense?

The third use case would be views that capture logical plans. They currently rely on tricks in connectors with refresh. I want to simplify/fix that by moving the refresh to Spark so that DSv2 connectors can pin versions correctly.

gengliangwang · 2025-10-15T16:56:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TimeTravelSpec.scala

+case class AsOfTimestamp(timestamp: Long) extends TimeTravelSpec {
+  override def toString: String = s"TIMESTAMP AS OF $timestamp"
+}
+case class AsOfVersion(version: String) extends TimeTravelSpec {


nit: add a blank line between two classes

gengliangwang · 2025-10-15T18:22:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

    cachedData.isEmpty
  }

+  private[sql] def numCachedEntries: Int = {


shall we add comment "// Test-only"

cloud-fan · 2025-10-15T18:26:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

  def isEmpty: Boolean = {
    cachedData.isEmpty
  }



The changes in this file is quite confusing. Before this PR, DataSourceV2Relation does not contain the time travel spec, but the Table instance should contain it. This means, if a table scan is cached with version 1, and then we uncache the same table scan but with version 2, we won't uncache the version 1 scan.

Now we put time travel spec in DataSourceV2Relation, which makes this behavior more reliable in case the Table instance does not contain the version.

I don't quite understand what we try to do here. If we want to bring the old behavior, we can simplily clear out the time travel spec in DataSourceV2Relation#canonicalized

github-actions bot added the SQL label Oct 14, 2025

aokolnychyi commented Oct 14, 2025

View reviewed changes

cloud-fan reviewed Oct 14, 2025

View reviewed changes

[SPARK-53732][SQL] Remember TimeTravelSpec in DataSourceV2Relation

28d1e4a

aokolnychyi force-pushed the spark-53732-v2 branch from 24133d5 to 28d1e4a Compare October 14, 2025 17:19

gengliangwang reviewed Oct 15, 2025

View reviewed changes

cloud-fan reviewed Oct 15, 2025

View reviewed changes

[SPARK-53732][SQL] Remember TimeTravelSpec in DataSourceV2Relation #52599

Are you sure you want to change the base?

[SPARK-53732][SQL] Remember TimeTravelSpec in DataSourceV2Relation #52599

Conversation

aokolnychyi commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Oct 14, 2025

Uh oh!

dongjoon-hyun commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aokolnychyi commented Oct 14, 2025 •

edited

Loading

aokolnychyi Oct 14, 2025 •

edited

Loading