Skip to content

[SPARK-48660][SQL] Fix explain result for CreateTableAsSelect #51013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -414,9 +414,20 @@ case class WriteDelta(
trait V2CreateTableAsSelectPlan
extends V2CreateTablePlan
with AnalysisOnlyCommand
with CTEInChildren {
with CTEInChildren
with ExecutableDuringAnalysis {
Copy link
Contributor

@LuciferYang LuciferYang May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuexing Does this pull request only address the scenario for DataSource V2? Is there no such issue for V1? Or was the fix for V1 omitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuciferYang I'm taking another look now. This fix is most likely not right as it replaces the whole structure of CreateTableAsSelect in both logic view and physical view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying something else now. I implemented CreateDataSourceTableAsSelectCommand.stats and ExecutedCommandExec.innerChildren. Here's how I understand the code

  • As to 'CREATE TABLE ... AS SELECT'
  • CreateDataSourceTableAsSelectCommand will finally be the LogicPlan
  • ExecutedCommandExec(cmd=CreateDataSourceTableAsSelectCommand) will finally be the PhysicalPlan

Thus , to have the expected output

  • CreateDataSourceTableAsSelectCommand adds stats implementation
  • ExecutedCommandExec adds innerChildren for QueryExecution.stringWithStats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it becomes something like this

== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, [id, name]
+- Project [id#16, name#17], Statistics(sizeInBytes=1.0 B)
+- Filter (id#16 > 0), Statistics(sizeInBytes=1.0 B)
+- SubqueryAlias spark_catalog.default.source_table, Statistics(sizeInBytes=1.0 B)
+- Relation spark_catalog.default.source_table[id#16,name#17] parquet, Statistics(sizeInBytes=0.0 B)

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, [id, name]
+- *(1) Filter (isnotnull(id#16) AND (id#16 > 0))
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.source_table[id#16,name#17] Batched: true, DataFilters: [isnotnull(id#16), (id#16 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yuexing/playground/spark/sql/core/spark-warehouse/org.apac..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: structid:int,name:string

def query: LogicalPlan

/**
* SPARK-48660: Override to return the optimized query plan for explain
*/
override def stageForExplain(): LogicalPlan = query

/**
* SPARK-48660: Override to return the optimized query plan for explain
*/
override def stats: Statistics = query.stats

override def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = {
withNameAndQuery(newName = name, newQuery = WithCTE(query, cteDefs))
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -300,4 +300,36 @@ class CreateTableAsSelectSuite extends DataSourceTest with SharedSparkSession {
stop = 57))
}
}

test("SPARK-48660: EXPLAIN COST should show statistics") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, [eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, versionid, changedocuments, hr, dt]
   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, hr#16, dt#15]
      +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, dt#15, hr#16]
         +- Filter (dt#15 >= 2023-11-29)
            +- SubqueryAlias spark_catalog.default.order_history_version_audit_rno
               +- Relation spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] parquet

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand
   +- CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, [eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, versionid, changedocuments, hr, dt]
         +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, hr#16, dt#15]
            +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, dt#15, hr#16]
               +- Filter (dt#15 >= 2023-11-29)
                  +- SubqueryAlias spark_catalog.default.order_history_version_audit_rno
                     +- Relation spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] parquet

From the cases provided by @wangyum , I believe there are two more critical issues here:

  1. The Optimized Logical Plan contains redundant SubqueryAlias nodes.
  2. The Physical Plan contains redundant SubqueryAlias and Relation nodes.

Therefore, in the test cases, I think we should primarily focus on making assertion checks for these issues.

In addition, I have printed out the result of explainOutput.

== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`target_table`, ErrorIfExists, [id, name]
   +- Project [id#126, name#127], Statistics(sizeInBytes=1.0 B)
      +- Filter (id#126 > 0), Statistics(sizeInBytes=1.0 B)
         +- SubqueryAlias spark_catalog.default.source_table, Statistics(sizeInBytes=1.0 B)
            +- Relation spark_catalog.default.source_table[id#126,name#127] parquet, Statistics(sizeInBytes=0.0 B)

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`target_table`, ErrorIfExists, [id, name]
   +- *(1) Filter (isnotnull(id#126) AND (id#126 > 0))
      +- *(1) ColumnarToRow
         +- FileScan parquet spark_catalog.default.source_table[id#126,name#127] Batched: true, DataFilters: [isnotnull(id#126), (id#126 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yangjie01/SourceCode/git/spark-sbt/sql/core/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: struct<id:int,name:string>

It seems that the second issue I described has been fixed, but the SubqueryAlias nodes still exist in the Optimized Logical Plan. Could you take a further look into this? @yuexing

@wangyum Is my description accurate? If there's anything incorrect, please help me correct it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I see. Let me also do a innerChildren fix in the Command class, which is the logic plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now the output is:

== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, Project [id#16, name#17], [id, name]
+- Project [id#16, name#17]
+- Filter (id#16 > 0)
+- Relation spark_catalog.default.source_table[id#16,name#17] parquet, Statistics(sizeInBytes=0.0 B)

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand spark_catalog.default.target_table, ErrorIfExists, Project [id#16, name#17], [id, name]
+- *(1) Filter (isnotnull(id#16) AND (id#16 > 0))
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.source_table[id#16,name#17] Batched: true, DataFilters: [isnotnull(id#16), (id#16 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yuexing/playground/spark/sql/core/spark-warehouse/org.apac..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: structid:int,name:string

withTable("source_table") {
// Create source table with data
sql("""
CREATE TABLE source_table (
id INT,
name STRING
) USING PARQUET
""")

// Get explain output for CTAS
val explainResult = sql("""
EXPLAIN COST
CREATE TABLE target_table
USING PARQUET
AS SELECT * FROM source_table WHERE id > 0
""").collect()

val explainOutput = explainResult.map(_.getString(0)).mkString("\n")
println(explainOutput)

// The explain output should contain statistics information
assert(explainOutput.contains("Statistics"),
s"EXPLAIN COST output should contain statistics information. Output: $explainOutput")

// The explain output should contain pushdown information
assert(explainOutput.contains("PushedFilters"),
s"EXPLAIN COST output should contain pushdown information. Output: $explainOutput")
}
}

}