[SPARK-48660][SQL] Fix explain result for CreateTableAsSelect #51013

yuexing · 2025-05-25T15:51:15Z

What changes were proposed in this pull request?

To fix the explain result of 'CreateTableAsSelect', according to how ExplainCommand processes, I find out the following work is needed

CreateDataSourceTableAsSelectCommand should implement stats
ExecutedCommandExec should also implement innerChildren for CreateDataSourceTableAsSelectCommand

Here's why:

As to 'CREATE TABLE ... AS SELECT'
CreateDataSourceTableAsSelectCommand will finally be the LogicPlan
ExecutedCommandExec(cmd=CreateDataSourceTableAsSelectCommand) will finally be the PhysicalPlan

Thus , to have the expected output

CreateDataSourceTableAsSelectCommand adds stats implementation
ExecutedCommandExec adds innerChildren for QueryExecution.stringWithStats

Why are the changes needed?

as reported in SPARK-48660, explain result of 'CreateTableAsSelect' is incorrect. It's missing stats for logic plan and optimization for physical plan.

Does this PR introduce any user-facing change?

Yes, it affects the explain result.

How was this patch tested?

UT.

Was this patch authored or co-authored using generative AI tooling?

No

LuciferYang · 2025-05-26T02:56:31Z

cc @wangyum FYI

LuciferYang · 2025-05-26T02:57:19Z

Could you fix the failed tests first? @yuexing

LuciferYang · 2025-05-26T05:22:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

@@ -414,9 +414,20 @@ case class WriteDelta(
 trait V2CreateTableAsSelectPlan
  extends V2CreateTablePlan
    with AnalysisOnlyCommand
-    with CTEInChildren {
+    with CTEInChildren
+    with ExecutableDuringAnalysis {


@yuexing Does this pull request only address the scenario for DataSource V2? Is there no such issue for V1? Or was the fix for V1 omitted?

@LuciferYang I'm taking another look now. This fix is most likely not right as it replaces the whole structure of CreateTableAsSelect in both logic view and physical view.

I'm trying something else now. I implemented CreateDataSourceTableAsSelectCommand.stats and ExecutedCommandExec.innerChildren. Here's how I understand the code

As to 'CREATE TABLE ... AS SELECT'

CreateDataSourceTableAsSelectCommand will finally be the LogicPlan

ExecutedCommandExec(cmd=CreateDataSourceTableAsSelectCommand) will finally be the PhysicalPlan

Thus , to have the expected output

CreateDataSourceTableAsSelectCommand adds stats implementation

ExecutedCommandExec adds innerChildren for QueryExecution.stringWithStats

and it becomes something like this

== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, [id, name]
+- Project [id#16, name#17], Statistics(sizeInBytes=1.0 B)
+- Filter (id#16 > 0), Statistics(sizeInBytes=1.0 B)
+- SubqueryAlias spark_catalog.default.source_table, Statistics(sizeInBytes=1.0 B)
+- Relation spark_catalog.default.source_table[id#16,name#17] parquet, Statistics(sizeInBytes=0.0 B)

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, [id, name]
+- *(1) Filter (isnotnull(id#16) AND (id#16 > 0))
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.source_table[id#16,name#17] Batched: true, DataFilters: [isnotnull(id#16), (id#16 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yuexing/playground/spark/sql/core/spark-warehouse/org.apac..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: structid:int,name:string

…tCommand

…Command

LuciferYang · 2025-05-27T05:56:18Z

sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala

@@ -300,4 +300,38 @@ class CreateTableAsSelectSuite extends DataSourceTest with SharedSparkSession {
          stop = 57))
    }
  }
+
+  test("SPARK-48660: EXPLAIN COST should show statistics") {


== Optimized Logical Plan == CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, [eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, versionid, changedocuments, hr, dt] +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, hr#16, dt#15] +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, dt#15, hr#16] +- Filter (dt#15 >= 2023-11-29) +- SubqueryAlias spark_catalog.default.order_history_version_audit_rno +- Relation spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] parquet == Physical Plan == Execute CreateDataSourceTableAsSelectCommand +- CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, [eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, versionid, changedocuments, hr, dt] +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, hr#16, dt#15] +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, dt#15, hr#16] +- Filter (dt#15 >= 2023-11-29) +- SubqueryAlias spark_catalog.default.order_history_version_audit_rno +- Relation spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] parquet

From the cases provided by @wangyum , I believe there are two more critical issues here:

The Optimized Logical Plan contains redundant SubqueryAlias nodes.

The Physical Plan contains redundant SubqueryAlias and Relation nodes.

Therefore, in the test cases, I think we should primarily focus on making assertion checks for these issues.

In addition, I have printed out the result of explainOutput.

== Optimized Logical Plan == CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`target_table`, ErrorIfExists, [id, name] +- Project [id#126, name#127], Statistics(sizeInBytes=1.0 B) +- Filter (id#126 > 0), Statistics(sizeInBytes=1.0 B) +- SubqueryAlias spark_catalog.default.source_table, Statistics(sizeInBytes=1.0 B) +- Relation spark_catalog.default.source_table[id#126,name#127] parquet, Statistics(sizeInBytes=0.0 B) == Physical Plan == Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`target_table`, ErrorIfExists, [id, name] +- *(1) Filter (isnotnull(id#126) AND (id#126 > 0)) +- *(1) ColumnarToRow +- FileScan parquet spark_catalog.default.source_table[id#126,name#127] Batched: true, DataFilters: [isnotnull(id#126), (id#126 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yangjie01/SourceCode/git/spark-sbt/sql/core/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: struct<id:int,name:string>

It seems that the second issue I described has been fixed, but the SubqueryAlias nodes still exist in the Optimized Logical Plan. Could you take a further look into this? @yuexing

@wangyum Is my description accurate? If there's anything incorrect, please help me correct it.

ok, I see. Let me also do a innerChildren fix in the Command class, which is the logic plan.

now the output is:

== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, Project [id#16, name#17], [id, name]
+- Project [id#16, name#17]
+- Filter (id#16 > 0)
+- Relation spark_catalog.default.source_table[id#16,name#17] parquet, Statistics(sizeInBytes=0.0 B)

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, Project [id#16, name#17], [id, name]
+- *(1) Filter (isnotnull(id#16) AND (id#16 > 0))
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.source_table[id#16,name#17] Batched: true, DataFilters: [isnotnull(id#16), (id#16 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yuexing/playground/spark/sql/core/spark-warehouse/org.apac..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: structid:int,name:string

…Command

…tCommand

LuciferYang · 2025-05-28T09:05:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

+  def optimizedPlanWithoutSubqueries: LogicalPlan = {
+    optimizedPlan match {
+      case s: CreateDataSourceTableAsSelectCommand =>
+        s.copy(query = EliminateSubqueryAliases(s.query))


Why is only the EliminateSubqueryAliases optimization applied to s.query?

because EliminateSubqueryAliases doesn't handle Command. That's why we have this unexpected explain result.

CreateDataSourceTableAsSelectCommand as logic plan, EliminateSubqueryAliases doesn't handle command.query

ExecutedCommandExec(cmd=CreateDataSourceTableAsSelectCommand) as physical plan, upon run will create a new QueryExecution with command.query. This is the time execution gets EliminateSubqueryAliases (as well as other optimization)

@LuciferYang

LuciferYang · 2025-05-28T11:21:12Z

The result seems correct. Do you have time to review this one? @wangyum

LuciferYang · 2025-06-17T03:34:53Z

friendly ping @wangyum

SPARK-48660 fix explain result for CreateTableAsSelect

0925a7f

github-actions bot added the SQL label May 25, 2025

SPARK-48660 fix explain result for CreateTableAsSelect

2f12c3e

HyukjinKwon changed the title ~~SPARK-48660 fix explain result for CreateTableAsSelect~~ [SPARK-48660][SQL] Fix explain result for CreateTableAsSelect May 26, 2025

LuciferYang reviewed May 26, 2025

View reviewed changes

yuexing added 2 commits May 27, 2025 00:26

[SPARK-48660][SQL]fix explain result for CreateDataSourceTableAsSelec…

6ee4ff4

…tCommand

SPARK-48660 [SQL]fix explain result for CreateDataSourceTableAsSelect…

7af5054

…Command

LuciferYang reviewed May 27, 2025

View reviewed changes

yuexing added 2 commits May 27, 2025 17:47

SPARK-48660 [SQL]fix explain result for CreateDataSourceTableAsSelect…

b1a9127

…Command

[SPARK-48660][SQL]fix explain result for CreateDataSourceTableAsSelec…

d443587

…tCommand

LuciferYang reviewed May 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48660][SQL] Fix explain result for CreateTableAsSelect #51013

[SPARK-48660][SQL] Fix explain result for CreateTableAsSelect #51013

yuexing commented May 25, 2025 •

edited

Loading

Uh oh!

LuciferYang commented May 26, 2025

Uh oh!

LuciferYang commented May 26, 2025

Uh oh!

LuciferYang May 26, 2025 •

edited

Loading

Uh oh!

yuexing May 26, 2025

Uh oh!

yuexing May 26, 2025

Uh oh!

yuexing May 26, 2025

Uh oh!

LuciferYang May 27, 2025

Uh oh!

yuexing May 27, 2025

Uh oh!

yuexing May 27, 2025

Uh oh!

LuciferYang May 28, 2025

Uh oh!

yuexing May 28, 2025

Uh oh!

LuciferYang commented May 28, 2025

Uh oh!

LuciferYang commented Jun 17, 2025

Uh oh!

Uh oh!

[SPARK-48660][SQL] Fix explain result for CreateTableAsSelect #51013

Are you sure you want to change the base?

[SPARK-48660][SQL] Fix explain result for CreateTableAsSelect #51013

Conversation

yuexing commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented May 26, 2025

Uh oh!

LuciferYang commented May 26, 2025

Uh oh!

LuciferYang May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented May 28, 2025

Uh oh!

LuciferYang commented Jun 17, 2025

Uh oh!

Uh oh!

yuexing commented May 25, 2025 •

edited

Loading

LuciferYang May 26, 2025 •

edited

Loading