-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-48660][SQL] Fix explain result for CreateTableAsSelect #51013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cc @wangyum FYI |
Could you fix the failed tests first? @yuexing |
@@ -414,9 +414,20 @@ case class WriteDelta( | |||
trait V2CreateTableAsSelectPlan | |||
extends V2CreateTablePlan | |||
with AnalysisOnlyCommand | |||
with CTEInChildren { | |||
with CTEInChildren | |||
with ExecutableDuringAnalysis { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuexing Does this pull request only address the scenario for DataSource V2? Is there no such issue for V1? Or was the fix for V1 omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LuciferYang I'm taking another look now. This fix is most likely not right as it replaces the whole structure of CreateTableAsSelect in both logic view and physical view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying something else now. I implemented CreateDataSourceTableAsSelectCommand.stats and ExecutedCommandExec.innerChildren. Here's how I understand the code
- As to 'CREATE TABLE ... AS SELECT'
- CreateDataSourceTableAsSelectCommand will finally be the LogicPlan
- ExecutedCommandExec(cmd=CreateDataSourceTableAsSelectCommand) will finally be the PhysicalPlan
Thus , to have the expected output
- CreateDataSourceTableAsSelectCommand adds stats implementation
- ExecutedCommandExec adds innerChildren for QueryExecution.stringWithStats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it becomes something like this
== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand spark_catalog
.default
.target_table
, ErrorIfExists, [id, name]
+- Project [id#16, name#17], Statistics(sizeInBytes=1.0 B)
+- Filter (id#16 > 0), Statistics(sizeInBytes=1.0 B)
+- SubqueryAlias spark_catalog.default.source_table, Statistics(sizeInBytes=1.0 B)
+- Relation spark_catalog.default.source_table[id#16,name#17] parquet, Statistics(sizeInBytes=0.0 B)
== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand spark_catalog
.default
.target_table
, ErrorIfExists, [id, name]
+- *(1) Filter (isnotnull(id#16) AND (id#16 > 0))
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.source_table[id#16,name#17] Batched: true, DataFilters: [isnotnull(id#16), (id#16 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yuexing/playground/spark/sql/core/spark-warehouse/org.apac..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: structid:int,name:string
@@ -300,4 +300,38 @@ class CreateTableAsSelectSuite extends DataSourceTest with SharedSparkSession { | |||
stop = 57)) | |||
} | |||
} | |||
|
|||
test("SPARK-48660: EXPLAIN COST should show statistics") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, [eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, versionid, changedocuments, hr, dt]
+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, hr#16, dt#15]
+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, dt#15, hr#16]
+- Filter (dt#15 >= 2023-11-29)
+- SubqueryAlias spark_catalog.default.order_history_version_audit_rno
+- Relation spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] parquet
== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand
+- CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, [eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, versionid, changedocuments, hr, dt]
+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, hr#16, dt#15]
+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, dt#15, hr#16]
+- Filter (dt#15 >= 2023-11-29)
+- SubqueryAlias spark_catalog.default.order_history_version_audit_rno
+- Relation spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] parquet
From the cases provided by @wangyum , I believe there are two more critical issues here:
- The
Optimized Logical Plan
contains redundantSubqueryAlias
nodes. - The
Physical Plan
contains redundantSubqueryAlias
andRelation
nodes.
Therefore, in the test cases, I think we should primarily focus on making assertion checks for these issues.
In addition, I have printed out the result of explainOutput
.
== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`target_table`, ErrorIfExists, [id, name]
+- Project [id#126, name#127], Statistics(sizeInBytes=1.0 B)
+- Filter (id#126 > 0), Statistics(sizeInBytes=1.0 B)
+- SubqueryAlias spark_catalog.default.source_table, Statistics(sizeInBytes=1.0 B)
+- Relation spark_catalog.default.source_table[id#126,name#127] parquet, Statistics(sizeInBytes=0.0 B)
== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand `spark_catalog`.`default`.`target_table`, ErrorIfExists, [id, name]
+- *(1) Filter (isnotnull(id#126) AND (id#126 > 0))
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.source_table[id#126,name#127] Batched: true, DataFilters: [isnotnull(id#126), (id#126 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yangjie01/SourceCode/git/spark-sbt/sql/core/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: struct<id:int,name:string>
It seems that the second issue I described has been fixed, but the SubqueryAlias
nodes still exist in the Optimized Logical Plan
. Could you take a further look into this? @yuexing
@wangyum Is my description accurate? If there's anything incorrect, please help me correct it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I see. Let me also do a innerChildren fix in the Command class, which is the logic plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now the output is:
== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand spark_catalog
.default
.target_table
, ErrorIfExists, Project [id#16, name#17], [id, name]
+- Project [id#16, name#17]
+- Filter (id#16 > 0)
+- Relation spark_catalog.default.source_table[id#16,name#17] parquet, Statistics(sizeInBytes=0.0 B)
== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand spark_catalog
.default
.target_table
, ErrorIfExists, Project [id#16, name#17], [id, name]
+- *(1) Filter (isnotnull(id#16) AND (id#16 > 0))
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.source_table[id#16,name#17] Batched: true, DataFilters: [isnotnull(id#16), (id#16 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yuexing/playground/spark/sql/core/spark-warehouse/org.apac..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: structid:int,name:string
def optimizedPlanWithoutSubqueries: LogicalPlan = { | ||
optimizedPlan match { | ||
case s: CreateDataSourceTableAsSelectCommand => | ||
s.copy(query = EliminateSubqueryAliases(s.query)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is only the EliminateSubqueryAliases
optimization applied to s.query
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because EliminateSubqueryAliases doesn't handle Command. That's why we have this unexpected explain result.
- CreateDataSourceTableAsSelectCommand as logic plan, EliminateSubqueryAliases doesn't handle command.query
- ExecutedCommandExec(cmd=CreateDataSourceTableAsSelectCommand) as physical plan, upon run will create a new QueryExecution with command.query. This is the time execution gets EliminateSubqueryAliases (as well as other optimization)
The result seems correct. Do you have time to review this one? @wangyum |
What changes were proposed in this pull request?
To fix the explain result of 'CreateTableAsSelect', according to how ExplainCommand processes, I find out the following work is needed
Here's why:
Thus , to have the expected output
Why are the changes needed?
as reported in SPARK-48660, explain result of 'CreateTableAsSelect' is incorrect. It's missing stats for logic plan and optimization for physical plan.
Does this PR introduce any user-facing change?
Yes, it affects the explain result.
How was this patch tested?
UT.
Was this patch authored or co-authored using generative AI tooling?
No