-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-48660][SQL] Fix explain result for CreateTableAsSelect #51013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 2 commits
0925a7f
2f12c3e
6ee4ff4
7af5054
b1a9127
d443587
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -300,4 +300,36 @@ class CreateTableAsSelectSuite extends DataSourceTest with SharedSparkSession { | |
stop = 57)) | ||
} | ||
} | ||
|
||
test("SPARK-48660: EXPLAIN COST should show statistics") { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
From the cases provided by @wangyum , I believe there are two more critical issues here:
Therefore, in the test cases, I think we should primarily focus on making assertion checks for these issues. In addition, I have printed out the result of
It seems that the second issue I described has been fixed, but the @wangyum Is my description accurate? If there's anything incorrect, please help me correct it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, I see. Let me also do a innerChildren fix in the Command class, which is the logic plan. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. now the output is: == Optimized Logical Plan == == Physical Plan == |
||
withTable("source_table") { | ||
// Create source table with data | ||
sql(""" | ||
CREATE TABLE source_table ( | ||
id INT, | ||
name STRING | ||
) USING PARQUET | ||
""") | ||
|
||
// Get explain output for CTAS | ||
val explainResult = sql(""" | ||
EXPLAIN COST | ||
CREATE TABLE target_table | ||
USING PARQUET | ||
AS SELECT * FROM source_table WHERE id > 0 | ||
""").collect() | ||
|
||
val explainOutput = explainResult.map(_.getString(0)).mkString("\n") | ||
println(explainOutput) | ||
|
||
// The explain output should contain statistics information | ||
assert(explainOutput.contains("Statistics"), | ||
s"EXPLAIN COST output should contain statistics information. Output: $explainOutput") | ||
|
||
// The explain output should contain pushdown information | ||
assert(explainOutput.contains("PushedFilters"), | ||
s"EXPLAIN COST output should contain pushdown information. Output: $explainOutput") | ||
} | ||
} | ||
|
||
} |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuexing Does this pull request only address the scenario for DataSource V2? Is there no such issue for V1? Or was the fix for V1 omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LuciferYang I'm taking another look now. This fix is most likely not right as it replaces the whole structure of CreateTableAsSelect in both logic view and physical view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying something else now. I implemented CreateDataSourceTableAsSelectCommand.stats and ExecutedCommandExec.innerChildren. Here's how I understand the code
Thus , to have the expected output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it becomes something like this
== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand
spark_catalog
.default
.target_table
, ErrorIfExists, [id, name]+- Project [id#16, name#17], Statistics(sizeInBytes=1.0 B)
+- Filter (id#16 > 0), Statistics(sizeInBytes=1.0 B)
+- SubqueryAlias spark_catalog.default.source_table, Statistics(sizeInBytes=1.0 B)
+- Relation spark_catalog.default.source_table[id#16,name#17] parquet, Statistics(sizeInBytes=0.0 B)
== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand CreateDataSourceTableAsSelectCommand
spark_catalog
.default
.target_table
, ErrorIfExists, [id, name]+- *(1) Filter (isnotnull(id#16) AND (id#16 > 0))
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.source_table[id#16,name#17] Batched: true, DataFilters: [isnotnull(id#16), (id#16 > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yuexing/playground/spark/sql/core/spark-warehouse/org.apac..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0)], ReadSchema: structid:int,name:string