[SPARK-52921][SQL] Specify outputPartitioning for UnionExec for partitioner aware case #51623

viirya · 2025-07-22T21:34:41Z

What changes were proposed in this pull request?

This patch updates outputPartitioning for UnionExec operator for the case that the partitioner is aware. So the output partitioning can be known.

Why are the changes needed?

Currently the output partitioning of UnionExec is simply unknown. But if the partitioner is known to be the same for all children RDDs, SparkContext.union produces a PartitionerAwareUnionRDD which reuses the partition. For such cases, the output partitioning of UnionExec is actually known to be the same as its children nodes.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

Was this patch authored or co-authored using generative AI tooling?

No

dongjoon-hyun · 2025-07-22T21:39:01Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -1606,11 +1606,15 @@ class SparkContext(config: SparkConf) extends Logging {
    new ReliableCheckpointRDD[T](this, path)
  }

+  protected[spark] def isPartitionerAwareUnion[T: ClassTag](rdds: Seq[RDD[T]]): Boolean = {


Could you add a comment about the assumption, rdds.filter(!_.partitions.isEmpty)? Otherwise, it may cause correctness issues later if we use this blindly.

Otherwise, we had better include the assumption inside this method.

Added comment and a check.

dongjoon-hyun · 2025-07-22T21:40:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

+  private lazy val childrenRDDs = children.map(_.execute())
+
+  override def outputPartitioning: Partitioning = {
+    val nonEmptyRdds = childrenRDDs.filter(!_.partitions.isEmpty)


ditto. We can remove this too if isPartitionerAwareUnion has the logic.

Because SparkContext.union uses nonEmptyRdds, so I didn't move nonEmptyRdds logic into isPartitionerAwareUnion. I leave to the callers to pass in non empty rdds.

Got it~ Thank you for the explanation.

dongjoon-hyun · 2025-07-22T21:41:00Z

cc @peter-toth

dongjoon-hyun · 2025-07-22T21:56:22Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -1606,11 +1606,17 @@ class SparkContext(config: SparkConf) extends Logging {
    new ReliableCheckpointRDD[T](this, path)
  }

+  // Note that input rdds must be all non-empty, i.e., rdds.filter(_.partitions.isEmpty).isEmpty
+  protected[spark] def isPartitionerAwareUnion[T: ClassTag](rdds: Seq[RDD[T]]): Boolean = {
+    assert(!rdds.exists(_.partitions.isEmpty), "Must not have empty RDDs")


dongjoon-hyun

+1, LGTM. Thank you, @viirya .

viirya · 2025-07-23T06:58:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

+      // child operator will be replaced by Spark in query planning later, in other
+      // words, `execute` won't be actually called on them during the execution of
+      // this plan. So we can safely return the default partitioning.
+      case e if NonFatal(e) => super.outputPartitioning


This handles nodes that don't implement execute method. The reason is described like the comment said.

peter-toth · 2025-07-23T10:08:44Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+  protected[spark] def isPartitionerAwareUnion[T: ClassTag](rdds: Seq[RDD[T]]): Boolean = {
+    assert(!rdds.exists(_.partitions.isEmpty), "Must not have empty RDDs")
+    val partitioners = rdds.flatMap(_.partitioner).toSet
+    rdds.forall(_.partitioner.isDefined) && partitioners.size == 1


It seems we don't need the partitioners set before the forall isDefined check.

peter-toth

LGTM, just a minor nit.

viirya · 2025-07-23T16:20:58Z

Hmm, there are a few test failures, I will take a look.

viirya · 2025-07-23T20:11:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "default partitioning.")
+      .version("4.1.0")
+      .booleanConf
+      .createWithDefault(true)


For safety, added an internal config for it.

Specify outputPartitioning for UnionExec for partitioner aware case

888c1d5

github-actions bot added SQL CORE labels Jul 22, 2025

viirya marked this pull request as draft July 22, 2025 21:35

dongjoon-hyun reviewed Jul 22, 2025

View reviewed changes

add comment

9b87f96

dongjoon-hyun reviewed Jul 22, 2025

View reviewed changes

add test

22fb207

viirya marked this pull request as ready for review July 22, 2025 23:00

viirya changed the title ~~[SPARK-XXXXX][SQL] Specify outputPartitioning for UnionExec for partitioner aware case~~ [SPARK-52921][SQL] Specify outputPartitioning for UnionExec for partitioner aware case Jul 22, 2025

dongjoon-hyun approved these changes Jul 22, 2025

View reviewed changes

viirya added 3 commits July 22, 2025 23:40

try to fix tests

9263c89

more

7c6921c

clean up

eaf34bf

viirya commented Jul 23, 2025

View reviewed changes

more

b35516b

peter-toth reviewed Jul 23, 2025

View reviewed changes

peter-toth approved these changes Jul 23, 2025

View reviewed changes

viirya added 2 commits July 23, 2025 13:08

fix

9465cca

for review

5d2e336

viirya commented Jul 23, 2025

View reviewed changes

fix

1ee11a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52921][SQL] Specify outputPartitioning for UnionExec for partitioner aware case #51623

[SPARK-52921][SQL] Specify outputPartitioning for UnionExec for partitioner aware case #51623

viirya commented Jul 22, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun Jul 22, 2025 •

edited

Loading

Uh oh!

viirya Jul 22, 2025

Uh oh!

dongjoon-hyun Jul 22, 2025

Uh oh!

viirya Jul 22, 2025

Uh oh!

dongjoon-hyun Jul 22, 2025

Uh oh!

dongjoon-hyun commented Jul 22, 2025

Uh oh!

dongjoon-hyun Jul 22, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

viirya Jul 23, 2025

Uh oh!

peter-toth Jul 23, 2025

Uh oh!

peter-toth left a comment

Uh oh!

viirya commented Jul 23, 2025

Uh oh!

viirya Jul 23, 2025

Uh oh!

Uh oh!

[SPARK-52921][SQL] Specify outputPartitioning for UnionExec for partitioner aware case #51623

Are you sure you want to change the base?

[SPARK-52921][SQL] Specify outputPartitioning for UnionExec for partitioner aware case #51623

Conversation

viirya commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viirya commented Jul 22, 2025 •

edited

Loading

dongjoon-hyun Jul 22, 2025 •

edited

Loading