Skip to content

Commit c30d744

Browse files
authored
Fix issue with migrating MANAGED hive_metastore table to UC (#2892)
- HMS MANAGED tables when deleted also delete their underlying data. - If an HMS-managed table is migrated to UC as EXTERNAL, dropping the HMS table will delete the underlying data file and render the UC table unusable, leading to a non-recoverable data loss. - Changing the MANAGED table to EXTERNAL may have consequences on regulatory data cleanup, as deleting the EXTERNAL table no longer deletes the underlying table. It would cause leakage of data when tables are dropped. - As with the case of duplicating the data, if new data is added to either HMS or UC, the other table goes out of sync requiring re-migration Tradeoffs are described in this table: <img width="639" alt="image" src="https://github.com/user-attachments/assets/04456618-aee4-4859-88b9-c612a9629429"> Resolves #2838
1 parent 752cbec commit c30d744

File tree

11 files changed

+260
-23
lines changed

11 files changed

+260
-23
lines changed

README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -588,15 +588,15 @@ Each of the upgraded objects will be marked with an `upgraded_from` property.
588588
This property will be used to identify the original location of the object in the metastore.
589589
We also add a `upgraded_from_workspace_id` property to the upgraded object, to identify the source workspace.
590590

591-
| Object Type | Description | Upgrade Method |
592-
|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
593-
| EXTERNAL_SYNC | Tables not saved to the DBFS file system that are supported by the sync command.<br/> These tables are in one of the following formats: DELTA, PARQUET, CSV, JSON, ORC, TEXT, AVRO | During the upgrade process, these table contents will remain intact and the metadata will be recreated in UC using the sync SQL command.<br/>More information about the sync command can be found [here](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-sync.html) |
594-
| EXTERNAL_HIVESERDE | Tables with table type "HIVE" that are not supported by the sync command | We provide two workflows for hiveserde table migration:<br/>1. Migrate all hiveserde tables using CTAS which we officially support.<br/>2. Migrate certain types of hiveserde in place, which is technically working, but the user need to accept the risk that the old files created by hiveserde may not be processed correctly by Spark datasource in corner cases. User will need to decide which workflow to runs first which will migrate the hiveserde tables and mark the `upgraded_to` property and hence those tables will be skipped in the migration workflow runs later. |
595-
| EXTERNAL_NO_SYNC | Tables not saved to the DBFS file system that are not supported by the sync command | The current upgrade process will migrate these tables to UC by creating a new managed table in UC and copying the data from the old table to the new table. The new table's format will be Delta. |
596-
| DBFS_ROOT_DELTA | Tables saved to the DBFS file system that are in Delta format | The current upgrade process will create a copy of these tables in UC using the "deep clone" command.<br/>More information about the deep clone command can be found [here](https://docs.databricks.com/en/sql/language-manual/delta-clone.html) |
597-
| DBFS_ROOT_NON_DELTA | Tables saved to the DBFS file system that are not in Delta format | The current upgrade process will create a managed table using CTAS | |
598-
| VIEW | Datbase Views | Views are recreated during the upgrade process. The view's definition will be modified to repoint to the new UC tables. Views should be migrated only after all the dependent tables have been migrated. The upgrade process account for View to View dependencies. |
599-
591+
| Object Type | Description | Upgrade Method |
592+
|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
593+
| EXTERNAL_SYNC | Tables not saved to the DBFS file system that are supported by the sync command.<br/> These tables are in one of the following formats: DELTA, PARQUET, CSV, JSON, ORC, TEXT, AVRO | During the upgrade process, these table contents will remain intact and the metadata will be recreated in UC using the sync SQL command.<br/>More information about the sync command can be found [here](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-sync.html) |
594+
| EXTERNAL_HIVESERDE | Tables with table type "HIVE" that are not supported by the sync command | We provide two workflows for hiveserde table migration:<br/>1. Migrate all hiveserde tables using CTAS which we officially support.<br/>2. Migrate certain types of hiveserde in place, which is technically working, but the user need to accept the risk that the old files created by hiveserde may not be processed correctly by Spark datasource in corner cases. User will need to decide which workflow to runs first which will migrate the hiveserde tables and mark the `upgraded_to` property and hence those tables will be skipped in the migration workflow runs later. |
595+
| EXTERNAL_NO_SYNC | Tables not saved to the DBFS file system that are not supported by the sync command | The current upgrade process will migrate these tables to UC by creating a new managed table in UC and copying the data from the old table to the new table. The new table's format will be Delta. |
596+
| DBFS_ROOT_DELTA | Tables saved to the DBFS file system that are in Delta format | The current upgrade process will create a copy of these tables in UC using the "deep clone" command.<br/>More information about the deep clone command can be found [here](https://docs.databricks.com/en/sql/language-manual/delta-clone.html) |
597+
| DBFS_ROOT_NON_DELTA | Tables saved to the DBFS file system that are not in Delta format | The current upgrade process will create a managed table using CTAS | |
598+
| VIEW | Datbase Views | Views are recreated during the upgrade process. The view's definition will be modified to repoint to the new UC tables. Views should be migrated only after all the dependent tables have been migrated. The upgrade process account for View to View dependencies. |
599+
| MANAGED | Tables that are created as managed table in hive_metastore. | Depending on the WorkspaceConfig property managed_table_external_storage: 1. If the property is set to default CLONE (selected during installation). The UC Table will be created as CTAS which will created a copy of the data in UC. 2 If the property is set to SYNC_AS_EXTERNAL, the UC Table will be created as a EXTERNAL table. There is a risk, if the managed HMS table is dropped, which will drop the data and it will affect the UC table as well. |
600600
The upgrade process can be triggered using the `migrate-tables` [UCX command](#migrate-tables-command)
601601

602602
Or by running the table migration workflows deployed to the workspace.

src/databricks/labs/ucx/config.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,8 @@ class WorkspaceConfig: # pylint: disable=too-many-instance-attributes
7474
# [INTERNAL ONLY] Whether the assessment should lint only specific dashboards.
7575
include_dashboard_ids: list[str] | None = None
7676

77+
managed_table_external_storage: str = 'CLONE'
78+
7779
def replace_inventory_variable(self, text: str) -> str:
7880
return text.replace("$inventory", f"hive_metastore.{self.inventory_database}")
7981

0 commit comments

Comments
 (0)