Add initial ingestion job.

goodwillpunning · goodwillpunning · commit f1204f3c9734 · 2025-10-28T12:29:57.000-04:00
Add unit test for profiler ingestion job deployment. Update profiler ingestion job to be wheel-based Update profiler ingestion job unit tests. Update docs (#2110) Update docs Update docs (#2111) Update docs again Add table ingestion logic to profiler ingest job. Update table ingestion exception handling. Add duckdb dependency in Job Task definition. Correct library dependencies. Update entry point name to . Narrow exception handling for single table ingestion. Parse args in execute main method.
diff --git a/docs/lakebridge/docs/assessment/analyzer/complexity_scoring.mdx b/docs/lakebridge/docs/assessment/analyzer/complexity_scoring.mdx
@@ -36,32 +36,6 @@ If the analyzer encounters a SQL procedure or function body inside a SQL file, i
 
 Teradata MLOAD and FLOAD scripts follow the same rules as above.
 
-## Informatica Code Analysis
-At the beginning of mapping analysis, mark mapping with complexity level of **LOW**
-
-If any of the following conditions are true, then mark the mapping as **MEDIUM** complexity:
-1. Number of expressions with 5+ function calls between 2 and 4
-2. Number of sources > 1
-3. Number of joins >= 1
-4. Number of lookups between 4 and 6
-5. Number of targets > 1
-6. Overall function call count >= 10
-7. Number of components (transformations) >= 10
-
-If any of the following conditions are true, then mark the mapping as **COMPLEX** complexity:
-1. Three MEDIUM breaks from the list above
-2. Number of expressions with 5+ function calls between 5 and 7
-3. Number of mapping components >= 20
-4. Overall function call count >= 20
-5. Complex or Unstructured nodes are being used (e.g. Normalizer)
-6. Number of lookups between 7 and 14
-
-If any of the following conditions are true, then mark the mapping as **VERY COMPLEX** complexity:
-1. Three COMPLEX breaks from the list above
-2. Number of expressions with 5+ function calls > 7
-3. Number of lookups > 15
-4. Number of job components >= 50
-
 ## DataStage Analysis
 At the beginning of job analysis, mark job with complexity level of **LOW**
 
@@ -229,4 +203,4 @@ If any of the following conditions are true, then mark the mapping as **VERY COM
 3. Number of lookups > 15
 4. Number of job components >= 50
 
-<div data-theme-toc="true"> </div>
+<div data-theme-toc="true"> </div>
diff --git a/docs/lakebridge/docs/assessment/analyzer/export_metadata.mdx b/docs/lakebridge/docs/assessment/analyzer/export_metadata.mdx
@@ -9,7 +9,7 @@ Analyzer expects all the legacy code to be exported into a folder accessible by
 All the major ETL platforms provide some kind of export of their code repositories. Typically this is done into XML or JSON formats which can be used to restore the environment. Here is a short guide for how to export metadata from various platforms:
 
 ## Microsoft SQL Server
-To extract metadata like Table, View, and Stored Procedures DDLs, you can use Microsoft SQL Server Management Studio (SSMS). 
+To extract metadata like Table, View, and Stored Procedures DDLs, you can use Microsoft SQL Server Management Studio (SSMS).
 * In Object Explorer, expand the node for the instance containing the database to be scripted.
 * Right-click on the database you want to script, and select Tasks > Generate Scripts.
   <img src={useBaseUrl('img/sql-server-export-object-explorer.png')} style={{ width: 500 }} alt="sql-server-export-object-explorer" />
@@ -20,49 +20,18 @@ To extract metadata like Table, View, and Stored Procedures DDLs, you can use Mi
   <img src={useBaseUrl('img/sql-server-export-set-scripting-options.png')} style={{ width: 500 }} alt="sql-server-export-set-scripting-options" />
 See https://learn.microsoft.com/en-us/ssms/scripting/generate-and-publish-scripts-wizard for more details on how to use the Generate Scripts wizard in SSMS.
 
-## Azure Synapse (Dedicated) 
+## Azure Synapse (Dedicated)
 Follow the same steps as for Microsoft SQL Server above.  The only difference is that you will need to connect to the Synapse Dedicated SQL pool instead of a regular SQL Server instance.
 
 ## Azure Synapse (Serverless)
 If you use Synapse Studio and have your SQL code saved in SQL scripts, you can export the files with the [Export-AzSynapseSqlScript PowerShell cmdlet](https://learn.microsoft.com/en-us/powershell/module/az.synapse/export-azsynapsesqlscript?view=azps-14.2.0&viewFallbackFrom=azps-13.4.0). This method requires [Azure PowerShell modules](https://learn.microsoft.com/en-us/powershell/azure/install-Az-ps?view=azps-0.10.0).<br/>
 
-Otherwise, you can use Microsoft SQL Server Management Studio (SSMS) to extract metadata like Table, View, and Stored Procedures DDLs. 
+Otherwise, you can use Microsoft SQL Server Management Studio (SSMS) to extract metadata like Table, View, and Stored Procedures DDLs.
 * Select “Object Explorer Details” under the View button in the toolbar
   <img src={useBaseUrl('img/synapse-objects-explorer-view.png')} alt="synapse-objects-explorer-view" />
 * For each object type, select the required objects to export and right-click on the selection to choose “Script as” > “CREATE To” > “File” as pictured below. <br/>
   <img src={useBaseUrl('img/synapse-objects-explorer-script-as.png')} style={{ width: 800}} alt="synapse-objects-explorer-script-as" />
 
-## PowerCenter
-
-* **Overview**<br/>
-  To run the BladeBridge analyzer or converters on Informatica XMLs, the XML file first need to be extracted out of the PowerCenter repository. Typically, it is easier to deal with the analysis and conversion of a relatively granular level, so extracting the artifacts at the workflow level is advisable.  Objects can be exported from Powercenter Repository Manager or using *pmrep* command
-
-* **Metadata Extraction**<br/>
-  To extract the metadata out of PowerCenter repository, use the following commands:
-
-* **Connect to repository**<br/>
-  ```
-  pmrep connect <list of credentials>
-  ```
-
-* **Get the list of folders**<br/>
-  ```
-  pmrep listobjects -o FOLDER
-  ```
-
-* **For each folder, get the list of workflows**<br/>
-  ```
-  pmrep listobjects -o WORKFLOW -f <your folder name>
-  ```
-
-* **Workflow extraction**<br/>
-  Create a batch script with the following command template for each folder.
-
-  Note: Excel can be used to create the script with the following command:
-  ```
-  pmrep objectexport -n workflow_name -o WORKFLOW -f folder_name -b -r -m -s -u path-to-output-file
-  ```
-
 ## DataStage
 
 * Typically in DataStage the easiest way to export the objects is by using the GUI.  However, Datastage has command line utilities to export via CLI.
@@ -92,12 +61,3 @@ Analyzer needs the .yxmd files. These can be obtained by Select File > Export to
 ## SAP Business Objects Data Services
 
 Instructions for export can be found in the following articles: https://help.sap.com/viewer/2d2abbb0fab34071a4c53b7de873241b/4.2.13/en-US/571901366d6d1014b3fc9283b0e91070.html https://help.sap.com/viewer/2d2abbb0fab34071a4c53b7de873241b/4.2.13/en-US/5718d4ba6d6d1014b3fc9283b0e91070.html
-
-## IICS / IDMC
-
-Select all the Mapping Configuration tasks you want to read the metadata from and export them as a single file.
-<img src={useBaseUrl('img/infacloud-export1.png')} alt="infacloud-export" />
-
-Note #1: Analyzer and Converter expect the metadata from InfaCloud to be preserved as zip files.  Please do not change the content of these files.
-
-Note #2: InfaCloud zip files are deeply nested.  Analyzer and Converter temporarily unzip the contents of the zip files into the folder locations associated with the output analyzer report and output code respectively.  On Windows OS, please keep the paths specified in `-d`, `-r`, `-o` switches short, as the fully exploded path may exceed the Windows max path limitation
diff --git a/docs/lakebridge/docs/overview.mdx b/docs/lakebridge/docs/overview.mdx
@@ -55,7 +55,6 @@ The table below summarizes the source platforms that we currently support:
 | Source Platform               | BladeBridge | Morpheus |   SQL    | ETL/Orchestration | dbt Repointing (Experimental) |
 |:------------------------------|:-----------:|:--------:|:--------:|:-----------------:|:-----------------------------:|
 | DataStage                     |  &#x2705;   |          | &#x2705; |     &#x2705;      |                               |
-| Informatica (PC)              |  &#x2705;   |          | &#x2705; |     &#x2705;      |                               |
 | Netezza                       |  &#x2705;   |          | &#x2705; |                   |                               |
 | Oracle (incl. ADS & Exadata)  |  &#x2705;   |          | &#x2705; |                   |                               |
 | Snowflake                     |             | &#x2705; | &#x2705; |                   |           &#x2705;            |
@@ -64,14 +63,6 @@ The table below summarizes the source platforms that we currently support:
 
 For more information on using the transpiler, refer to the [Transpile][3] documentation.
 
-:::danger Alert
-#### INFORMATICA CLOUD SUPPORT
-We recently discovered a bug that prevents the converter from launching correctly for Informatica Cloud sources.
-The underlying issue is understood and we are working on resolving it, but it will require significant work beyond a hotfix:
-it will probably be several weeks before this becomes available.
-Please reach out to your databricks rep.
-::::
-
 Post-migration Reconciliation
 -----------------------------
 
diff --git a/docs/lakebridge/docs/transpile/overview.mdx b/docs/lakebridge/docs/transpile/overview.mdx
@@ -109,7 +109,7 @@ Finally, being built on the assumption that the input SQL is correct, Morpheus d
 
 ### BladeBridge
 
-**BladeBridge** is a flexible and extensible code conversion engine designed to accelerate modernization to Databricks. It supports a wide range of **ETL platforms** (e.g., Informatica, DataStage) and **SQL-based systems** (e.g., Oracle, Teradata, Netezza), accepting inputs in the form of exported metadata and scripts.
+**BladeBridge** is a flexible and extensible code conversion engine designed to accelerate modernization to Databricks. It supports a wide range of **ETL platforms** (e.g., DataStage) and **SQL-based systems** (e.g., Oracle, Teradata, Netezza), accepting inputs in the form of exported metadata and scripts.
 
 The converter generates **Databricks-compatible outputs** including:
 
diff --git a/docs/lakebridge/docs/transpile/pluggable_transpilers/bladebridge_configuration.mdx b/docs/lakebridge/docs/transpile/pluggable_transpilers/bladebridge_configuration.mdx
@@ -61,9 +61,6 @@ with the full path. Otherwise, the converter will look for the file in the same
 
 ```
     Name of the various files per source you might want to inherit from:
-      - INFAPC:
-          - Target SPARKSQL: "base_infapc2databricks_sparksql.json",
-          - Target PYSPARK : "base_infapc2databricks_pyspark.json"
       - DATASTAGE:
           - Target SPARKSQL : "base_datastage2databricks_sparksql.json",
           - Target PYSPARK : "base_datastage2databricks_pyspark.json"
@@ -443,7 +440,7 @@ In addition to these structural and behavioral controls, ETL configuration files
   - SQL within a `SELECT` statement of a source component
   - `pre-SQL` or `post-SQL` snippets executed before or after data movement
 
-These supporting configuration files closely resemble the structure and purpose of SQL configuration files, but are scoped to **fragment-level transformations**, **function handling**, and **data manipulation tasks** commonly found in visual ETL platforms such as Informatica PowerCenter or IBM DataStage.
+These supporting configuration files closely resemble the structure and purpose of SQL configuration files, but are scoped to **fragment-level transformations**, **function handling**, and **data manipulation tasks** commonly found in visual ETL platforms such as IBM DataStage.
 
 ETL configuration thus serves as the orchestration layer that combines rule-based transformation with output formatting, system integration, and extensibility.
 
@@ -456,12 +453,12 @@ ETL configuration thus serves as the orchestration layer that combines rule-base
 | use_notebook_md                               | Indicates whether Databricks notebook markdown should be used.                         | 1                                                                      |
 | script_header                                 | Adds a code block at the start of the generated script, often for imports or metadata. | # Databricks notebook source\n from datetime import datetime       |
 | script_footer                                 | Adds a code block at the end of the generated script | quit()      |
-| rowid_expression                              | Specifies the expression used to compute a row ID.  This is needed for InfaPC because of the way InfaPC links nodes in a mapping                                     | xxhash64(%DELIMITED_COLUMN_LIST%) as %ROWID_COL_NAME%                  |
+| rowid_expression                              | Specifies the expression used to compute a row ID.                                       | xxhash64(%DELIMITED_COLUMN_LIST%) as %ROWID_COL_NAME%                  |
 | rowid_column_name | Name of the column containing rowid | source_record_id |
 | dataset_creation_method                       | Indicates whether datasets are created as CTEs or tables.  <br/>"TABLE" is typically used for lift and shift, but "CTE" can be used for custom dbt outputs                            | TABLE or CTE                                                                 |
 | table_creation_statement                      | Template for creating a temporary table from a SQL block.                              | %TABLE_NAME% = spark.sql (rf\"\"\"%INNER_SQL%\"\"\"%FORMAT_SPEC%)<br/> %TABLE_NAME%. createOrReplaceTempView(\"`%TABLE_NAME%`\")  |
 | ddl_statement_wrap                            | Wraps DDL statements in a Spark SQL invocation.                                        | spark.sql(f"""%INNER_SQL%"""%FORMAT_SPEC%).display()                   |
-| etl_converter_config_file                     | Points to a secondary config file for ETL expression conversion.                       | infa2databricks.json                                                   |
+| etl_converter_config_file                     | Points to a secondary config file for ETL expression conversion.                       | base_datastage2databricks_pyspark.json                                                   |
 | commands | section on how to generate various read and write statements for different system types |
 | system_type_class | system type classifications |
 | conform_source_columns | instructs the writer to generate column-conforming statement for sources |
diff --git a/docs/lakebridge/docs/transpile/pluggable_transpilers/bladebridge_overview.mdx b/docs/lakebridge/docs/transpile/pluggable_transpilers/bladebridge_overview.mdx
@@ -9,7 +9,7 @@ import CodeBlock from '@theme/CodeBlock';
 
 It supports both:
 - **SQL-based platforms** (e.g., Oracle, Teradata, Netezza, Microsoft SQL Server)
-- **ETL platforms** (e.g., Informatica, DataStage)
+- **ETL platforms** (e.g. DataStage)
 
 ---
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -92,6 +92,7 @@ dependencies = [
 
 [project.entry-points.databricks]
 reconcile = "databricks.labs.lakebridge.reconcile.execute:main"
+profiler_dashboards = "databricks.labs.lakebridge.assessments.dashboards.execute:main"
 
 [tool.hatch.envs.default.scripts]
 test         = "pytest --cov src --cov-report=xml tests/unit"
diff --git a/src/databricks/labs/lakebridge/assessments/dashboards/execute.py b/src/databricks/labs/lakebridge/assessments/dashboards/execute.py
@@ -0,0 +1,120 @@
+import logging
+import os
+import sys
+
+import duckdb
+from pyspark.sql import SparkSession
+
+from databricks.labs.lakebridge.assessments.profiler_validator import EmptyTableValidationCheck, build_validation_report
+
+logger = logging.getLogger(__name__)
+
+
+def main(*argv) -> None:
+    logger.debug(f"Arguments received: {argv}")
+    assert len(sys.argv) == 4, f"Invalid number of arguments: {len(sys.argv)}"
+    catalog_name = sys.argv[0]
+    schema_name = sys.argv[1]
+    extract_location = sys.argv[2]
+    source_tech = sys.argv[3]
+    logger.info(f"Validating {source_tech} profiler extract located at '{extract_location}'.")
+    valid_extract = _validate_profiler_extract(extract_location)
+    if valid_extract:
+        _ingest_profiler_tables(catalog_name, schema_name, extract_location)
+    else:
+        raise ValueError("Corrupt or invalid profiler extract.")
+
+
+def _validate_profiler_extract(extract_location: str) -> bool:
+    logger.info("Validating the profiler extract file.")
+    validation_checks = []
+    try:
+        with duckdb.connect(database=extract_location) as duck_conn:
+            tables = duck_conn.execute("SHOW ALL TABLES").fetchall()
+            for table in tables:
+                fq_table_name = f"{table[0]}.{table[1]}.{table[2]}"
+                empty_check = EmptyTableValidationCheck(fq_table_name)
+                validation_checks.append(empty_check)
+            report = build_validation_report(validation_checks, duck_conn)
+    except duckdb.IOException as e:
+        logger.exception(f"Could not access the profiler extract: '{extract_location}'.")
+        raise e
+    except Exception as e:
+        logger.exception(f"Unable to validate the profiler extract: '{extract_location}'.")
+        raise e
+
+    if len(report) > 0:
+        report_errors = list(filter(lambda x: x.outcome == "FAIL" and x.severity == "ERROR", report))
+        num_errors = len(report_errors)
+        logger.info(f"There are {num_errors} validation errors in the profiler extract.")
+        for error in report_errors:
+            logging.info(error)
+    else:
+        raise ValueError("Profiler extract validation report is empty.")
+    return num_errors == 0
+
+
+def _ingest_profiler_tables(catalog_name: str, schema_name: str, extract_location: str) -> None:
+    try:
+        with duckdb.connect(database=extract_location) as duck_conn:
+            tables_to_ingest = duck_conn.execute("SHOW ALL TABLES").fetchall()
+    except duckdb.IOException as e:
+        logger.error(f"Could not access the profiler extract: '{extract_location}': {e}")
+        raise duckdb.IOException(f"Could not access the profiler extract: '{extract_location}'.") from e
+    except Exception as e:
+        logger.error(f"Unable to read tables from profiler extract: '{extract_location}': {e}")
+        raise e
+
+    if len(tables_to_ingest) == 0:
+        raise ValueError("Profiler extract contains no tables.")
+
+    successful_tables = []
+    unsuccessful_tables = []
+    for source_table in tables_to_ingest:
+        try:
+            fq_source_table_name = f"{source_table[0]}.{source_table[1]}.{source_table[2]}"
+            fq_delta_table_name = f"{catalog_name}.{schema_name}.{source_table[2]}"
+            logger.info(f"Ingesting profiler table: '{fq_source_table_name}'")
+            _ingest_table(extract_location, fq_source_table_name, fq_delta_table_name)
+            successful_tables.append(fq_source_table_name)
+        except (ValueError, IndexError, TypeError) as e:
+            logger.error(f"Failed to construct source and destination table names: {e}")
+            unsuccessful_tables.append(source_table)
+        except duckdb.Error as e:
+            logger.error(f"Failed to ingest table from profiler database: {e}")
+            unsuccessful_tables.append(source_table)
+    logger.info(f"Ingested {len(successful_tables)} tables from profiler extract.")
+    logger.info(",".join(successful_tables))
+    logger.info(f"Failed to ingest {len(unsuccessful_tables)} tables from profiler extract.")
+    logger.info(",".join(unsuccessful_tables))
+
+
+def _ingest_table(extract_location: str, source_table_name: str, target_table_name: str) -> None:
+    """
+    Ingest a table from a DuckDB profiler extract into a managed Delta table in Unity Catalog.
+    """
+    try:
+        with duckdb.connect(database=extract_location, read_only=True) as duck_conn:
+            query = f"SELECT * FROM {source_table_name}"
+            pdf = duck_conn.execute(query).df()
+            # Save table as a managed Delta table in Unity Catalog
+            logger.info(f"Saving profiler table '{target_table_name}' to Unity Catalog.")
+            spark = SparkSession.builder.getOrCreate()
+            df = spark.createDataFrame(pdf)
+            df.write.format("delta").mode("overwrite").saveAsTable(target_table_name)
+    except duckdb.CatalogException as e:
+        logger.error(f"Could not find source table '{source_table_name}' in profiler extract: {e}")
+        raise duckdb.CatalogException(f"Could not find source table '{source_table_name}' in profiler extract.") from e
+    except duckdb.IOException as e:
+        logger.error(f"Could not access the profiler extract: '{extract_location}': {e}")
+        raise duckdb.IOException(f"Could not access the profiler extract: '{extract_location}'.") from e
+    except Exception as e:
+        logger.error(f"Unable to ingest table '{source_table_name}' from profiler extract: {e}")
+        raise e
+
+
+if __name__ == "__main__":
+    # Ensure that the ingestion job is being run on a Databricks cluster
+    if "DATABRICKS_RUNTIME_VERSION" not in os.environ:
+        raise SystemExit("The Lakebridge profiler ingestion job is only intended to run in a Databricks Runtime.")
+    main(*sys.argv)
diff --git a/src/databricks/labs/lakebridge/assessments/profiler_validator.py b/src/databricks/labs/lakebridge/assessments/profiler_validator.py
diff --git a/src/databricks/labs/lakebridge/deployment/job.py b/src/databricks/labs/lakebridge/deployment/job.py
diff --git a/tests/unit/deployment/test_job.py b/tests/unit/deployment/test_job.py