-
Notifications
You must be signed in to change notification settings - Fork 77
Add Profiler Extract Ingestion Job for Local Dashboards #2101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| ) | ||
| ], | ||
| "tasks": [ | ||
| NotebookTask( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sundarshankar89 's comment from PR#2000 "There are 2 ways we can implement this, have the ingestion job as python package and use a wheel task Or have the notebook upload and then run the jobs.I prefer option 1."
|
✅ 46/46 passed, 5 flaky, 2m50s total Flaky tests:
Running from acceptance #2793 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2101 +/- ##
==========================================
+ Coverage 64.87% 65.03% +0.15%
==========================================
Files 96 96
Lines 7895 7933 +38
Branches 821 823 +2
==========================================
+ Hits 5122 5159 +37
- Misses 2593 2594 +1
Partials 180 180 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| def _job_profiler_ingestion_task(self, task_key: str, description: str, lakebridge_wheel_path: str) -> Task: | ||
| libraries = [ | ||
| compute.Library(whl=lakebridge_wheel_path), | ||
| compute.PythonPyPiLibrary(package="duckdb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ingestion job is dependent on duckdb library to read the profiler extract tables.
|
|
||
| def main(*argv) -> None: | ||
| logger.debug(f"Arguments received: {argv}") | ||
| assert len(sys.argv) == 4, f"Invalid number of arguments: {len(sys.argv)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Manually" testing this main() function outside of the wheel file, there appeared to be 3 additional arguments pertaining to the Python notebook session: 1) Interpreter, 2) -f flag, 3) env settings as a JSON file. Please review that the assumption that they will not be present in a wheel based job task is correct.
fbbc4cd to
4380362
Compare
4380362 to
46668ab
Compare
Add unit test for profiler ingestion job deployment. Update profiler ingestion job to be wheel-based Update profiler ingestion job unit tests. Update docs (#2110) Update docs Update docs (#2111) Update docs again Add table ingestion logic to profiler ingest job. Update table ingestion exception handling. Add duckdb dependency in Job Task definition. Correct library dependencies. Update entry point name to . Narrow exception handling for single table ingestion. Parse args in execute main method.
46668ab to
f1204f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: I m thinking about how to make this more generic for all sources. so Maintaning a static list will help here.
| def _ingest_profiler_tables(catalog_name: str, schema_name: str, extract_location: str) -> None: | ||
| try: | ||
| with duckdb.connect(database=extract_location) as duck_conn: | ||
| tables_to_ingest = duck_conn.execute("SHOW ALL TABLES").fetchall() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will rather have static list of tables and validate them if they are all present else raise a warning or log them in a audit table.
Also I would like to manage a run summary table stating what tabels and how many records for better reconcile and obs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the second part, do you mean a physical table stored in UC catalog? I keep track of the successfully ingested tables and skipped tables on line 71-72 and log it in line 86-89.
Changes
What does this PR do?
Adds a new function to the common job deployer to install the local ingestion job. The job transforms profiler extracts into Unity Catalog–managed tables in the user’s local Databricks workspace, enabling the profiler summary (“local”) dashboards.
Relevant implementation details
install_stateisn’t lost between create/update and save, especially if an exception is raised before the save.Caveats/things to watch out for when reviewing:
Linked issues
This PR compliments PR#2000.
Functionality
databricks labs lakebridge ...Tests