Skip to content

[SPARK-48660][SQL] Fix explain result for CreateTableAsSelect #51013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

yuexing
Copy link
Contributor

@yuexing yuexing commented May 25, 2025

What changes were proposed in this pull request?

To fix the explain result of 'CreateTableAsSelect', according to how ExplainCommand processes, I find out the following work is needed

  • CreateDataSourceTableAsSelectCommand should implement stats
  • ExecutedCommandExec should also implement innerChildren for CreateDataSourceTableAsSelectCommand

Here's why:

  • As to 'CREATE TABLE ... AS SELECT'
  • CreateDataSourceTableAsSelectCommand will finally be the LogicPlan
  • ExecutedCommandExec(cmd=CreateDataSourceTableAsSelectCommand) will finally be the PhysicalPlan

Thus , to have the expected output

  • CreateDataSourceTableAsSelectCommand adds stats implementation
  • ExecutedCommandExec adds innerChildren for QueryExecution.stringWithStats

Why are the changes needed?

as reported in SPARK-48660, explain result of 'CreateTableAsSelect' is incorrect. It's missing stats for logic plan and optimization for physical plan.

Does this PR introduce any user-facing change?

Yes, it affects the explain result.

How was this patch tested?

UT.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label May 25, 2025
@HyukjinKwon HyukjinKwon changed the title SPARK-48660 fix explain result for CreateTableAsSelect [SPARK-48660][SQL] Fix explain result for CreateTableAsSelect May 26, 2025
@LuciferYang
Copy link
Contributor

cc @wangyum FYI

@LuciferYang
Copy link
Contributor

Could you fix the failed tests first? @yuexing

@@ -414,9 +414,20 @@ case class WriteDelta(
trait V2CreateTableAsSelectPlan
extends V2CreateTablePlan
with AnalysisOnlyCommand
with CTEInChildren {
with CTEInChildren
with ExecutableDuringAnalysis {
Copy link
Contributor

@LuciferYang LuciferYang May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuexing Does this pull request only address the scenario for DataSource V2? Is there no such issue for V1? Or was the fix for V1 omitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuciferYang I'm taking another look now. This fix is most likely not right as it replaces the whole structure of CreateTableAsSelect in both logic view and physical view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying something else now. I implemented CreateDataSourceTableAsSelectCommand.stats and ExecutedCommandExec.innerChildren. Here's how I understand the code

  • As to 'CREATE TABLE ... AS SELECT'
  • CreateDataSourceTableAsSelectCommand will finally be the LogicPlan
  • ExecutedCommandExec(cmd=CreateDataSourceTableAsSelectCommand) will finally be the PhysicalPlan

Thus , to have the expected output

  • CreateDataSourceTableAsSelectCommand adds stats implementation
  • ExecutedCommandExec adds innerChildren for QueryExecution.stringWithStats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it becomes something like this

== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, [id, name]
+- Project [id#16, name#17], Statistics(sizeInBytes=1.0 B)
+- Filter (id#16 > 0), Statistics(sizeInBytes=1.0 B)
+- SubqueryAlias spark_catalog.default.source_table, Statistics(sizeInBytes=1.0 B)
+- Relation spark_catalog.default.source_table[id#16,name#17] parquet, Statistics(sizeInBytes=0.0 B)

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, [id, name]
+- *(1) Filter (isnotnull(id#16) AND (id#16 > 0))
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.source_table[id#16,name#17] Batched: true, DataFilters: [isnotnull(id#16), (id#16 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yuexing/playground/spark/sql/core/spark-warehouse/org.apac..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: structid:int,name:string

@@ -300,4 +300,38 @@ class CreateTableAsSelectSuite extends DataSourceTest with SharedSparkSession {
stop = 57))
}
}

test("SPARK-48660: EXPLAIN COST should show statistics") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, [eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, versionid, changedocuments, hr, dt]
   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, hr#16, dt#15]
      +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, dt#15, hr#16]
         +- Filter (dt#15 >= 2023-11-29)
            +- SubqueryAlias spark_catalog.default.order_history_version_audit_rno
               +- Relation spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] parquet

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand
   +- CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, [eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, versionid, changedocuments, hr, dt]
         +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, hr#16, dt#15]
            +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, dt#15, hr#16]
               +- Filter (dt#15 >= 2023-11-29)
                  +- SubqueryAlias spark_catalog.default.order_history_version_audit_rno
                     +- Relation spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] parquet

From the cases provided by @wangyum , I believe there are two more critical issues here:

  1. The Optimized Logical Plan contains redundant SubqueryAlias nodes.
  2. The Physical Plan contains redundant SubqueryAlias and Relation nodes.

Therefore, in the test cases, I think we should primarily focus on making assertion checks for these issues.

In addition, I have printed out the result of explainOutput.

== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`target_table`, ErrorIfExists, [id, name]
   +- Project [id#126, name#127], Statistics(sizeInBytes=1.0 B)
      +- Filter (id#126 > 0), Statistics(sizeInBytes=1.0 B)
         +- SubqueryAlias spark_catalog.default.source_table, Statistics(sizeInBytes=1.0 B)
            +- Relation spark_catalog.default.source_table[id#126,name#127] parquet, Statistics(sizeInBytes=0.0 B)

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`target_table`, ErrorIfExists, [id, name]
   +- *(1) Filter (isnotnull(id#126) AND (id#126 > 0))
      +- *(1) ColumnarToRow
         +- FileScan parquet spark_catalog.default.source_table[id#126,name#127] Batched: true, DataFilters: [isnotnull(id#126), (id#126 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yangjie01/SourceCode/git/spark-sbt/sql/core/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: struct<id:int,name:string>

It seems that the second issue I described has been fixed, but the SubqueryAlias nodes still exist in the Optimized Logical Plan. Could you take a further look into this? @yuexing

@wangyum Is my description accurate? If there's anything incorrect, please help me correct it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I see. Let me also do a innerChildren fix in the Command class, which is the logic plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now the output is:

== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, Project [id#16, name#17], [id, name]
+- Project [id#16, name#17]
+- Filter (id#16 > 0)
+- Relation spark_catalog.default.source_table[id#16,name#17] parquet, Statistics(sizeInBytes=0.0 B)

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, Project [id#16, name#17], [id, name]
+- *(1) Filter (isnotnull(id#16) AND (id#16 > 0))
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.source_table[id#16,name#17] Batched: true, DataFilters: [isnotnull(id#16), (id#16 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yuexing/playground/spark/sql/core/spark-warehouse/org.apac..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: structid:int,name:string

def optimizedPlanWithoutSubqueries: LogicalPlan = {
optimizedPlan match {
case s: CreateDataSourceTableAsSelectCommand =>
s.copy(query = EliminateSubqueryAliases(s.query))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is only the EliminateSubqueryAliases optimization applied to s.query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because EliminateSubqueryAliases doesn't handle Command. That's why we have this unexpected explain result.

  • CreateDataSourceTableAsSelectCommand as logic plan, EliminateSubqueryAliases doesn't handle command.query
  • ExecutedCommandExec(cmd=CreateDataSourceTableAsSelectCommand) as physical plan, upon run will create a new QueryExecution with command.query. This is the time execution gets EliminateSubqueryAliases (as well as other optimization)

@LuciferYang

@LuciferYang
Copy link
Contributor

The result seems correct. Do you have time to review this one? @wangyum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants