You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-52576][SDP] Drop/recreate on full refresh and MV update
### What changes were proposed in this pull request?
Some pipeline runs result in wiping out and replacing all the data for a table:
- Every run of a materialized view
- Runs of streaming tables that have the "full refresh" flag
In the current implementation, this "wipe out and replace" is implemented by:
- Truncating the table
- Altering the table to drop/update/add columns that don't match the columns in the DataFrame for the current run
The reason that we want originally wanted to truncate + alter instead of drop / recreate is that dropping has some undesirable effects. E.g. it interrupts readers of the table and wipes away things like ACLs.
However, we discovered that not all catalogs support dropping columns (e.g. Hive does not), and there’s no way to tell whether a catalog supports dropping columns or not. So this PR changes the implementation to drop/recreate the table instead of truncate/alter.
### Why are the changes needed?
See section above.
### Does this PR introduce _any_ user-facing change?
Yes, see section above. No releases contained the old behavior.
### How was this patch tested?
- Tests in MaterializeTablesSuite
- Ran the tests in MaterializeTablesSuite with Hive instead of the default catalog
### Was this patch authored or co-authored using generative AI tooling?
No
Closesapache#51280 from sryza/drop-on-full-refresh.
Authored-by: Sandy Ryza <sandy.ryza@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
0 commit comments