[SPARK-53891][SQL] Model DSV2 Commit Operation Metrics API #52595

szehon-ho · 2025-10-13T19:01:35Z

What changes were proposed in this pull request?

#51377 added a DataSourceV2 API that sends operation metrics along with the commit, via a map of string, long. Change this to a proper model.

Suggestion from @aokolnychyi

Why are the changes needed?

It would be cleaner to model it as a proper object so that it is more clear what metrics Spark sends, and to handle future cases where metrics may not be long values.

Does this PR introduce any user-facing change?

No, unreleased DSV2 API.

How was this patch tested?

Existing tests

Was this patch authored or co-authored using generative AI tooling?

No

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/BatchWrite.java

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/MergeOperationMetrics.java

aokolnychyi · 2025-10-14T01:00:30Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/OperationMetrics.java

+ * @since 4.1.0
+ */
+@Evolving
+public interface OperationMetrics {


This naming seems reasonable to me but if folks have better ideas, it would be great to hear them.

cc @cloud-fan @viirya @gengliangwang @dongjoon-hyun @huaxingao

+1 for the AS-IS name, OperationMetrics.

If it is specific to write, maybe WriteMetrics? OperationMetrics also looks okay.

I think the word operation is reasonable. An alternative would be to call it OperationSummary to distinguish from regular metrics as we may pass some String values in the future too, not just counts. That said, I am not sure calling it XXXSummary would make it any better.

What do you think, @szehon-ho @viirya @dongjoon-hyun?

Or OperationMetadata?

I also had the same thought — why not just use WriteMetrics/WriterMetrics when reviewing this PR.
It seems more straightforward and specific.

Renamed it to WriterMetrics. There is already a writeMetric in the class V2ExistingTableWriteExec

aokolnychyi · 2025-10-14T01:02:44Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/MergeMetrics.java

+  /**
+   * Returns the number of target rows copied unmodified because they did not match any action.
+   */
+  long numTargetRowsCopied();


Seems like we will use -1 if unknown. Shall we document this?

I though these always have at least 0 if we find MergeRowsExec,ie https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/MergeRowsExec.scala#L49 . These initialize it with 0.

It is true that I set -1 in V2TableWriteExec::getOperationMetrics but I thought it will not hit normally.

Makes sense. Do we have tests ensuring 1/1 match?

We have tests that assert all metrics are set to 0 if not used, if that is the question?

aokolnychyi · 2025-10-14T01:04:58Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

+    collectFirst(query) { case m: MergeRowsExec => m }.map { n =>
+      val metrics = n.metrics
+      MergeOperationMetricsImpl(
+        metrics.get("numTargetRowsCopied").map(_.value).getOrElse(-1L),


Can we add constants for these in a separate PR? It seems fragile.

aokolnychyi · 2025-10-14T01:08:20Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/OperationMetrics.java

+ * @since 4.1.0
+ */
+@Evolving
+public interface OperationMetrics {


I am thinking how to simplify consumption of these objects in connectors. The question is whether this interface should have some sort of operation method that would tell the type of metrics. That said, it is probably not the end of the world if connectors do a class check.

Thoughts anyone?

Would be a bit duplicated if we have both class and an enum?

aokolnychyi · 2025-10-14T01:09:42Z

I feel like using a proper object is the right call here compared to the map. Left some questions.

Would love to hear what others think too.

viirya · 2025-10-14T18:17:23Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/MergeOperationMetrics.java

+ * @since 4.1.0
+ */
+@Evolving
+public interface MergeOperationMetrics extends OperationMetrics {


Maybe MergeMetrics? Not a strong option.

Sounds ok to me too

Yeah, I generally prefer shorter names if they are descriptive. I would probably lean towards MergeMetrics here.

+1 for MergeMetrics

viirya · 2025-10-14T18:19:36Z

...catalyst/src/main/scala/org/apache/spark/sql/connector/write/MergeOperationMetricsImpl.scala

+/**
+ * Implementation of {@link MergeOperationMetrics} that provides merge operation metrics.
+ */
+private[sql] case class MergeOperationMetricsImpl(


Wonder why don't simply have MergeOperationMetrics as an implementation class but separated interface + impl here? Do we suppose to have different MergeOperationMetrics impl?

I chat with @aokolnychyi yesterday off line, he suggest this way to hide the constructor from the API, like LogicalWriteInfoImpl. Else we have to make a public constructor in the public MergeOperationMetrics java API so Spark can construct it, which we may need to make a builder to support adding new metric, etc.

Okay, makes sense.

aokolnychyi · 2025-10-15T00:55:06Z

@cloud-fan @gengliangwang, I would want you folks to take a look as well, if possible.

gengliangwang · 2025-10-15T04:55:21Z

...t/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryRowLevelOperationTable.scala

-      metrics.asScala.map {
-        case (key, value) => commitProperties += key -> String.valueOf(value)
+    override def commit(messages: Array[WriterCommitMessage], metrics: OperationMetrics): Unit = {
+      metrics match {


shall we have a test case to verify the new metrics?

Oh these are then returned and checked via existing test cases. https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala#L1816

do you think something else?

gengliangwang · 2025-10-15T05:01:16Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/MergeMetrics.java

+  /**
+   * Returns the number of target rows deleted.
+   */
+  long numTargetRowsDeleted();


If all the metrics starts with numTargetRows, shall we just call it numRows?

Oh we plan to add 'numSourceRows' in a next pr. btw, these match the current Delta metric names: https://github.com/delta-io/delta/blob/c0943e863aacac1365bd6beaa9f23d6bc9a4f316/spark/src/main/scala/org/apache/spark/sql/delta/commands/merge/MergeStats.scala#L208

github-actions bot added the SQL label Oct 13, 2025

szehon-ho force-pushed the SPARK-53891 branch from b249670 to 09c7d44 Compare October 13, 2025 20:05

[SPARK-53891][SQL] Model DSV2 Commit Operation Metrics API

4fe3f19

szehon-ho force-pushed the SPARK-53891 branch from 09c7d44 to 4fe3f19 Compare October 13, 2025 20:17

make impl class into case class

84b2eb6

szehon-ho force-pushed the SPARK-53891 branch from 3769ff8 to 84b2eb6 Compare October 13, 2025 23:56

aokolnychyi reviewed Oct 14, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/BatchWrite.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Oct 14, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/MergeOperationMetrics.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Oct 14, 2025

View reviewed changes

Review comments

200c878

viirya reviewed Oct 14, 2025

View reviewed changes

gengliangwang reviewed Oct 15, 2025

View reviewed changes

szehon-ho added 2 commits October 15, 2025 17:08

Rename to WriteMetrics

3ff6840

rename to writermetrics

28fe2c9

[SPARK-53891][SQL] Model DSV2 Commit Operation Metrics API #52595

Are you sure you want to change the base?

[SPARK-53891][SQL] Model DSV2 Commit Operation Metrics API #52595

Conversation

szehon-ho commented Oct 13, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gengliangwang Oct 15, 2025 •

edited

Loading

szehon-ho Oct 16, 2025 •

edited

Loading

szehon-ho Oct 14, 2025 •

edited

Loading

szehon-ho Oct 14, 2025 •

edited

Loading