Skip to content

Commit f1204f3

Browse files
Add initial ingestion job.
Add unit test for profiler ingestion job deployment. Update profiler ingestion job to be wheel-based Update profiler ingestion job unit tests. Update docs (#2110) Update docs Update docs (#2111) Update docs again Add table ingestion logic to profiler ingest job. Update table ingestion exception handling. Add duckdb dependency in Job Task definition. Correct library dependencies. Update entry point name to . Narrow exception handling for single table ingestion. Parse args in execute main method.
1 parent 2bf8292 commit f1204f3

File tree

11 files changed

+328
-91
lines changed

11 files changed

+328
-91
lines changed

docs/lakebridge/docs/assessment/analyzer/complexity_scoring.mdx

Lines changed: 1 addition & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -36,32 +36,6 @@ If the analyzer encounters a SQL procedure or function body inside a SQL file, i
3636

3737
Teradata MLOAD and FLOAD scripts follow the same rules as above.
3838

39-
## Informatica Code Analysis
40-
At the beginning of mapping analysis, mark mapping with complexity level of **LOW**
41-
42-
If any of the following conditions are true, then mark the mapping as **MEDIUM** complexity:
43-
1. Number of expressions with 5+ function calls between 2 and 4
44-
2. Number of sources > 1
45-
3. Number of joins >= 1
46-
4. Number of lookups between 4 and 6
47-
5. Number of targets > 1
48-
6. Overall function call count >= 10
49-
7. Number of components (transformations) >= 10
50-
51-
If any of the following conditions are true, then mark the mapping as **COMPLEX** complexity:
52-
1. Three MEDIUM breaks from the list above
53-
2. Number of expressions with 5+ function calls between 5 and 7
54-
3. Number of mapping components >= 20
55-
4. Overall function call count >= 20
56-
5. Complex or Unstructured nodes are being used (e.g. Normalizer)
57-
6. Number of lookups between 7 and 14
58-
59-
If any of the following conditions are true, then mark the mapping as **VERY COMPLEX** complexity:
60-
1. Three COMPLEX breaks from the list above
61-
2. Number of expressions with 5+ function calls > 7
62-
3. Number of lookups > 15
63-
4. Number of job components >= 50
64-
6539
## DataStage Analysis
6640
At the beginning of job analysis, mark job with complexity level of **LOW**
6741

@@ -229,4 +203,4 @@ If any of the following conditions are true, then mark the mapping as **VERY COM
229203
3. Number of lookups > 15
230204
4. Number of job components >= 50
231205

232-
<div data-theme-toc="true"> </div>
206+
<div data-theme-toc="true"> </div>

docs/lakebridge/docs/assessment/analyzer/export_metadata.mdx

Lines changed: 3 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Analyzer expects all the legacy code to be exported into a folder accessible by
99
All the major ETL platforms provide some kind of export of their code repositories. Typically this is done into XML or JSON formats which can be used to restore the environment. Here is a short guide for how to export metadata from various platforms:
1010

1111
## Microsoft SQL Server
12-
To extract metadata like Table, View, and Stored Procedures DDLs, you can use Microsoft SQL Server Management Studio (SSMS).
12+
To extract metadata like Table, View, and Stored Procedures DDLs, you can use Microsoft SQL Server Management Studio (SSMS).
1313
* In Object Explorer, expand the node for the instance containing the database to be scripted.
1414
* Right-click on the database you want to script, and select Tasks > Generate Scripts.
1515
<img src={useBaseUrl('img/sql-server-export-object-explorer.png')} style={{ width: 500 }} alt="sql-server-export-object-explorer" />
@@ -20,49 +20,18 @@ To extract metadata like Table, View, and Stored Procedures DDLs, you can use Mi
2020
<img src={useBaseUrl('img/sql-server-export-set-scripting-options.png')} style={{ width: 500 }} alt="sql-server-export-set-scripting-options" />
2121
See https://learn.microsoft.com/en-us/ssms/scripting/generate-and-publish-scripts-wizard for more details on how to use the Generate Scripts wizard in SSMS.
2222

23-
## Azure Synapse (Dedicated)
23+
## Azure Synapse (Dedicated)
2424
Follow the same steps as for Microsoft SQL Server above. The only difference is that you will need to connect to the Synapse Dedicated SQL pool instead of a regular SQL Server instance.
2525

2626
## Azure Synapse (Serverless)
2727
If you use Synapse Studio and have your SQL code saved in SQL scripts, you can export the files with the [Export-AzSynapseSqlScript PowerShell cmdlet](https://learn.microsoft.com/en-us/powershell/module/az.synapse/export-azsynapsesqlscript?view=azps-14.2.0&viewFallbackFrom=azps-13.4.0). This method requires [Azure PowerShell modules](https://learn.microsoft.com/en-us/powershell/azure/install-Az-ps?view=azps-0.10.0).<br/>
2828

29-
Otherwise, you can use Microsoft SQL Server Management Studio (SSMS) to extract metadata like Table, View, and Stored Procedures DDLs.
29+
Otherwise, you can use Microsoft SQL Server Management Studio (SSMS) to extract metadata like Table, View, and Stored Procedures DDLs.
3030
* Select “Object Explorer Details” under the View button in the toolbar
3131
<img src={useBaseUrl('img/synapse-objects-explorer-view.png')} alt="synapse-objects-explorer-view" />
3232
* For each object type, select the required objects to export and right-click on the selection to choose “Script as” > “CREATE To” > “File” as pictured below. <br/>
3333
<img src={useBaseUrl('img/synapse-objects-explorer-script-as.png')} style={{ width: 800}} alt="synapse-objects-explorer-script-as" />
3434

35-
## PowerCenter
36-
37-
* **Overview**<br/>
38-
To run the BladeBridge analyzer or converters on Informatica XMLs, the XML file first need to be extracted out of the PowerCenter repository. Typically, it is easier to deal with the analysis and conversion of a relatively granular level, so extracting the artifacts at the workflow level is advisable. Objects can be exported from Powercenter Repository Manager or using *pmrep* command
39-
40-
* **Metadata Extraction**<br/>
41-
To extract the metadata out of PowerCenter repository, use the following commands:
42-
43-
* **Connect to repository**<br/>
44-
```
45-
pmrep connect <list of credentials>
46-
```
47-
48-
* **Get the list of folders**<br/>
49-
```
50-
pmrep listobjects -o FOLDER
51-
```
52-
53-
* **For each folder, get the list of workflows**<br/>
54-
```
55-
pmrep listobjects -o WORKFLOW -f <your folder name>
56-
```
57-
58-
* **Workflow extraction**<br/>
59-
Create a batch script with the following command template for each folder.
60-
61-
Note: Excel can be used to create the script with the following command:
62-
```
63-
pmrep objectexport -n workflow_name -o WORKFLOW -f folder_name -b -r -m -s -u path-to-output-file
64-
```
65-
6635
## DataStage
6736

6837
* Typically in DataStage the easiest way to export the objects is by using the GUI. However, Datastage has command line utilities to export via CLI.
@@ -92,12 +61,3 @@ Analyzer needs the .yxmd files. These can be obtained by Select File > Export to
9261
## SAP Business Objects Data Services
9362

9463
Instructions for export can be found in the following articles: https://help.sap.com/viewer/2d2abbb0fab34071a4c53b7de873241b/4.2.13/en-US/571901366d6d1014b3fc9283b0e91070.html https://help.sap.com/viewer/2d2abbb0fab34071a4c53b7de873241b/4.2.13/en-US/5718d4ba6d6d1014b3fc9283b0e91070.html
95-
96-
## IICS / IDMC
97-
98-
Select all the Mapping Configuration tasks you want to read the metadata from and export them as a single file.
99-
<img src={useBaseUrl('img/infacloud-export1.png')} alt="infacloud-export" />
100-
101-
Note #1: Analyzer and Converter expect the metadata from InfaCloud to be preserved as zip files. Please do not change the content of these files.
102-
103-
Note #2: InfaCloud zip files are deeply nested. Analyzer and Converter temporarily unzip the contents of the zip files into the folder locations associated with the output analyzer report and output code respectively. On Windows OS, please keep the paths specified in `-d`, `-r`, `-o` switches short, as the fully exploded path may exceed the Windows max path limitation

docs/lakebridge/docs/overview.mdx

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,6 @@ The table below summarizes the source platforms that we currently support:
5555
| Source Platform | BladeBridge | Morpheus | SQL | ETL/Orchestration | dbt Repointing (Experimental) |
5656
|:------------------------------|:-----------:|:--------:|:--------:|:-----------------:|:-----------------------------:|
5757
| DataStage | &#x2705; | | &#x2705; | &#x2705; | |
58-
| Informatica (PC) | &#x2705; | | &#x2705; | &#x2705; | |
5958
| Netezza | &#x2705; | | &#x2705; | | |
6059
| Oracle (incl. ADS & Exadata) | &#x2705; | | &#x2705; | | |
6160
| Snowflake | | &#x2705; | &#x2705; | | &#x2705; |
@@ -64,14 +63,6 @@ The table below summarizes the source platforms that we currently support:
6463

6564
For more information on using the transpiler, refer to the [Transpile][3] documentation.
6665

67-
:::danger Alert
68-
#### INFORMATICA CLOUD SUPPORT
69-
We recently discovered a bug that prevents the converter from launching correctly for Informatica Cloud sources.
70-
The underlying issue is understood and we are working on resolving it, but it will require significant work beyond a hotfix:
71-
it will probably be several weeks before this becomes available.
72-
Please reach out to your databricks rep.
73-
::::
74-
7566
Post-migration Reconciliation
7667
-----------------------------
7768

docs/lakebridge/docs/transpile/overview.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ Finally, being built on the assumption that the input SQL is correct, Morpheus d
109109

110110
### BladeBridge
111111

112-
**BladeBridge** is a flexible and extensible code conversion engine designed to accelerate modernization to Databricks. It supports a wide range of **ETL platforms** (e.g., Informatica, DataStage) and **SQL-based systems** (e.g., Oracle, Teradata, Netezza), accepting inputs in the form of exported metadata and scripts.
112+
**BladeBridge** is a flexible and extensible code conversion engine designed to accelerate modernization to Databricks. It supports a wide range of **ETL platforms** (e.g., DataStage) and **SQL-based systems** (e.g., Oracle, Teradata, Netezza), accepting inputs in the form of exported metadata and scripts.
113113

114114
The converter generates **Databricks-compatible outputs** including:
115115

docs/lakebridge/docs/transpile/pluggable_transpilers/bladebridge_configuration.mdx

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -61,9 +61,6 @@ with the full path. Otherwise, the converter will look for the file in the same
6161

6262
```
6363
Name of the various files per source you might want to inherit from:
64-
- INFAPC:
65-
- Target SPARKSQL: "base_infapc2databricks_sparksql.json",
66-
- Target PYSPARK : "base_infapc2databricks_pyspark.json"
6764
- DATASTAGE:
6865
- Target SPARKSQL : "base_datastage2databricks_sparksql.json",
6966
- Target PYSPARK : "base_datastage2databricks_pyspark.json"
@@ -443,7 +440,7 @@ In addition to these structural and behavioral controls, ETL configuration files
443440
- SQL within a `SELECT` statement of a source component
444441
- `pre-SQL` or `post-SQL` snippets executed before or after data movement
445442

446-
These supporting configuration files closely resemble the structure and purpose of SQL configuration files, but are scoped to **fragment-level transformations**, **function handling**, and **data manipulation tasks** commonly found in visual ETL platforms such as Informatica PowerCenter or IBM DataStage.
443+
These supporting configuration files closely resemble the structure and purpose of SQL configuration files, but are scoped to **fragment-level transformations**, **function handling**, and **data manipulation tasks** commonly found in visual ETL platforms such as IBM DataStage.
447444

448445
ETL configuration thus serves as the orchestration layer that combines rule-based transformation with output formatting, system integration, and extensibility.
449446

@@ -456,12 +453,12 @@ ETL configuration thus serves as the orchestration layer that combines rule-base
456453
| use_notebook_md | Indicates whether Databricks notebook markdown should be used. | 1 |
457454
| script_header | Adds a code block at the start of the generated script, often for imports or metadata. | # Databricks notebook source\n from datetime import datetime |
458455
| script_footer | Adds a code block at the end of the generated script | quit() |
459-
| rowid_expression | Specifies the expression used to compute a row ID. This is needed for InfaPC because of the way InfaPC links nodes in a mapping | xxhash64(%DELIMITED_COLUMN_LIST%) as %ROWID_COL_NAME% |
456+
| rowid_expression | Specifies the expression used to compute a row ID. | xxhash64(%DELIMITED_COLUMN_LIST%) as %ROWID_COL_NAME% |
460457
| rowid_column_name | Name of the column containing rowid | source_record_id |
461458
| dataset_creation_method | Indicates whether datasets are created as CTEs or tables. <br/>"TABLE" is typically used for lift and shift, but "CTE" can be used for custom dbt outputs | TABLE or CTE |
462459
| table_creation_statement | Template for creating a temporary table from a SQL block. | %TABLE_NAME% = spark.sql (rf\"\"\"%INNER_SQL%\"\"\"%FORMAT_SPEC%)<br/> %TABLE_NAME%. createOrReplaceTempView(\"`%TABLE_NAME%`\") |
463460
| ddl_statement_wrap | Wraps DDL statements in a Spark SQL invocation. | spark.sql(f"""%INNER_SQL%"""%FORMAT_SPEC%).display() |
464-
| etl_converter_config_file | Points to a secondary config file for ETL expression conversion. | infa2databricks.json |
461+
| etl_converter_config_file | Points to a secondary config file for ETL expression conversion. | base_datastage2databricks_pyspark.json |
465462
| commands | section on how to generate various read and write statements for different system types |
466463
| system_type_class | system type classifications |
467464
| conform_source_columns | instructs the writer to generate column-conforming statement for sources |

docs/lakebridge/docs/transpile/pluggable_transpilers/bladebridge_overview.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ import CodeBlock from '@theme/CodeBlock';
99

1010
It supports both:
1111
- **SQL-based platforms** (e.g., Oracle, Teradata, Netezza, Microsoft SQL Server)
12-
- **ETL platforms** (e.g., Informatica, DataStage)
12+
- **ETL platforms** (e.g. DataStage)
1313

1414
---
1515

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@ dependencies = [
9292

9393
[project.entry-points.databricks]
9494
reconcile = "databricks.labs.lakebridge.reconcile.execute:main"
95+
profiler_dashboards = "databricks.labs.lakebridge.assessments.dashboards.execute:main"
9596

9697
[tool.hatch.envs.default.scripts]
9798
test = "pytest --cov src --cov-report=xml tests/unit"
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
import logging
2+
import os
3+
import sys
4+
5+
import duckdb
6+
from pyspark.sql import SparkSession
7+
8+
from databricks.labs.lakebridge.assessments.profiler_validator import EmptyTableValidationCheck, build_validation_report
9+
10+
logger = logging.getLogger(__name__)
11+
12+
13+
def main(*argv) -> None:
14+
logger.debug(f"Arguments received: {argv}")
15+
assert len(sys.argv) == 4, f"Invalid number of arguments: {len(sys.argv)}"
16+
catalog_name = sys.argv[0]
17+
schema_name = sys.argv[1]
18+
extract_location = sys.argv[2]
19+
source_tech = sys.argv[3]
20+
logger.info(f"Validating {source_tech} profiler extract located at '{extract_location}'.")
21+
valid_extract = _validate_profiler_extract(extract_location)
22+
if valid_extract:
23+
_ingest_profiler_tables(catalog_name, schema_name, extract_location)
24+
else:
25+
raise ValueError("Corrupt or invalid profiler extract.")
26+
27+
28+
def _validate_profiler_extract(extract_location: str) -> bool:
29+
logger.info("Validating the profiler extract file.")
30+
validation_checks = []
31+
try:
32+
with duckdb.connect(database=extract_location) as duck_conn:
33+
tables = duck_conn.execute("SHOW ALL TABLES").fetchall()
34+
for table in tables:
35+
fq_table_name = f"{table[0]}.{table[1]}.{table[2]}"
36+
empty_check = EmptyTableValidationCheck(fq_table_name)
37+
validation_checks.append(empty_check)
38+
report = build_validation_report(validation_checks, duck_conn)
39+
except duckdb.IOException as e:
40+
logger.exception(f"Could not access the profiler extract: '{extract_location}'.")
41+
raise e
42+
except Exception as e:
43+
logger.exception(f"Unable to validate the profiler extract: '{extract_location}'.")
44+
raise e
45+
46+
if len(report) > 0:
47+
report_errors = list(filter(lambda x: x.outcome == "FAIL" and x.severity == "ERROR", report))
48+
num_errors = len(report_errors)
49+
logger.info(f"There are {num_errors} validation errors in the profiler extract.")
50+
for error in report_errors:
51+
logging.info(error)
52+
else:
53+
raise ValueError("Profiler extract validation report is empty.")
54+
return num_errors == 0
55+
56+
57+
def _ingest_profiler_tables(catalog_name: str, schema_name: str, extract_location: str) -> None:
58+
try:
59+
with duckdb.connect(database=extract_location) as duck_conn:
60+
tables_to_ingest = duck_conn.execute("SHOW ALL TABLES").fetchall()
61+
except duckdb.IOException as e:
62+
logger.error(f"Could not access the profiler extract: '{extract_location}': {e}")
63+
raise duckdb.IOException(f"Could not access the profiler extract: '{extract_location}'.") from e
64+
except Exception as e:
65+
logger.error(f"Unable to read tables from profiler extract: '{extract_location}': {e}")
66+
raise e
67+
68+
if len(tables_to_ingest) == 0:
69+
raise ValueError("Profiler extract contains no tables.")
70+
71+
successful_tables = []
72+
unsuccessful_tables = []
73+
for source_table in tables_to_ingest:
74+
try:
75+
fq_source_table_name = f"{source_table[0]}.{source_table[1]}.{source_table[2]}"
76+
fq_delta_table_name = f"{catalog_name}.{schema_name}.{source_table[2]}"
77+
logger.info(f"Ingesting profiler table: '{fq_source_table_name}'")
78+
_ingest_table(extract_location, fq_source_table_name, fq_delta_table_name)
79+
successful_tables.append(fq_source_table_name)
80+
except (ValueError, IndexError, TypeError) as e:
81+
logger.error(f"Failed to construct source and destination table names: {e}")
82+
unsuccessful_tables.append(source_table)
83+
except duckdb.Error as e:
84+
logger.error(f"Failed to ingest table from profiler database: {e}")
85+
unsuccessful_tables.append(source_table)
86+
logger.info(f"Ingested {len(successful_tables)} tables from profiler extract.")
87+
logger.info(",".join(successful_tables))
88+
logger.info(f"Failed to ingest {len(unsuccessful_tables)} tables from profiler extract.")
89+
logger.info(",".join(unsuccessful_tables))
90+
91+
92+
def _ingest_table(extract_location: str, source_table_name: str, target_table_name: str) -> None:
93+
"""
94+
Ingest a table from a DuckDB profiler extract into a managed Delta table in Unity Catalog.
95+
"""
96+
try:
97+
with duckdb.connect(database=extract_location, read_only=True) as duck_conn:
98+
query = f"SELECT * FROM {source_table_name}"
99+
pdf = duck_conn.execute(query).df()
100+
# Save table as a managed Delta table in Unity Catalog
101+
logger.info(f"Saving profiler table '{target_table_name}' to Unity Catalog.")
102+
spark = SparkSession.builder.getOrCreate()
103+
df = spark.createDataFrame(pdf)
104+
df.write.format("delta").mode("overwrite").saveAsTable(target_table_name)
105+
except duckdb.CatalogException as e:
106+
logger.error(f"Could not find source table '{source_table_name}' in profiler extract: {e}")
107+
raise duckdb.CatalogException(f"Could not find source table '{source_table_name}' in profiler extract.") from e
108+
except duckdb.IOException as e:
109+
logger.error(f"Could not access the profiler extract: '{extract_location}': {e}")
110+
raise duckdb.IOException(f"Could not access the profiler extract: '{extract_location}'.") from e
111+
except Exception as e:
112+
logger.error(f"Unable to ingest table '{source_table_name}' from profiler extract: {e}")
113+
raise e
114+
115+
116+
if __name__ == "__main__":
117+
# Ensure that the ingestion job is being run on a Databricks cluster
118+
if "DATABRICKS_RUNTIME_VERSION" not in os.environ:
119+
raise SystemExit("The Lakebridge profiler ingestion job is only intended to run in a Databricks Runtime.")
120+
main(*sys.argv)

0 commit comments

Comments
 (0)