[SPARK-52576][SDP] Drop/recreate on full refresh and MV update

sryza · asl3 · commit b2cea5fab04f · 2025-07-14T09:58:45.000-07:00
### What changes were proposed in this pull request? Some pipeline runs result in wiping out and replacing all the data for a table: - Every run of a materialized view - Runs of streaming tables that have the "full refresh" flag In the current implementation, this "wipe out and replace" is implemented by: - Truncating the table - Altering the table to drop/update/add columns that don't match the columns in the DataFrame for the current run The reason that we want originally wanted to truncate + alter instead of drop / recreate is that dropping has some undesirable effects. E.g. it interrupts readers of the table and wipes away things like ACLs. However, we discovered that not all catalogs support dropping columns (e.g. Hive does not), and there’s no way to tell whether a catalog supports dropping columns or not. So this PR changes the implementation to drop/recreate the table instead of truncate/alter. ### Why are the changes needed? See section above. ### Does this PR introduce _any_ user-facing change? Yes, see section above. No releases contained the old behavior. ### How was this patch tested? - Tests in MaterializeTablesSuite - Ran the tests in MaterializeTablesSuite with Hive instead of the default catalog ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51280 from sryza/drop-on-full-refresh. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
diff --git a/sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DatasetManager.scala b/sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DatasetManager.scala
@@ -178,12 +178,13 @@ object DatasetManager extends Logging {
     }
 
     // Wipe the data if we need to
-    if ((isFullRefresh || !table.isStreamingTable) && existingTableOpt.isDefined) {
-      context.spark.sql(s"TRUNCATE TABLE ${table.identifier.quotedString}")
+    val dropTable = (isFullRefresh || !table.isStreamingTable) && existingTableOpt.isDefined
+    if (dropTable) {
+      catalog.dropTable(identifier)
     }
 
     // Alter the table if we need to
-    if (existingTableOpt.isDefined) {
+    if (existingTableOpt.isDefined && !dropTable) {
       val existingSchema = existingTableOpt.get.schema()
 
       val targetSchema = if (table.isStreamingTable && !isFullRefresh) {
@@ -198,7 +199,7 @@ object DatasetManager extends Logging {
     }
 
     // Create the table if we need to
-    if (existingTableOpt.isEmpty) {
+    if (dropTable || existingTableOpt.isEmpty) {
       catalog.createTable(
         identifier,
         new TableInfo.Builder()