Skip to content

Crawler: support for object ownership #2774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 66 commits into from
Oct 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
573fca2
Crawler support for object ownership.
asnare Oct 1, 2024
b8f7e69
Move the ownership code into its own module, and stub unit tests.
asnare Oct 3, 2024
07fa875
Skip users that don't have a user-name.
asnare Oct 3, 2024
28daa7e
Sort by the user-name attribute, not name.
asnare Oct 3, 2024
f4e247e
Materialize list earlier, to aid debugging.
asnare Oct 3, 2024
d0c22db
Documentation references for how administrators are marked for worksp…
asnare Oct 3, 2024
467f912
Ensure that unit tests reset the (class-level) cache before they start.
asnare Oct 3, 2024
33cb841
Fix mock workspace identifier to have the correct type.
asnare Oct 3, 2024
3a1868c
Trivial integration test for locating an administrator.
asnare Oct 3, 2024
ec23bb0
Start implementing unit tests for the Ownership component.
asnare Oct 3, 2024
b9dd2a3
Refactor fixture code for mocking accounts and groups.
asnare Oct 3, 2024
a6b46c1
Merge branch 'main' into inventory-object-owners
asnare Oct 3, 2024
57bf8c3
Revert plumbing the workspace client into CrawlerBase
asnare Oct 3, 2024
7db7aa0
More reverting.
asnare Oct 3, 2024
7676f7c
Whitespace.
asnare Oct 3, 2024
7010237
Implement more unit tests.
asnare Oct 3, 2024
d1e24eb
Refactor workspace/account admin lookup into separate components.
asnare Oct 3, 2024
9155e19
Update integration test for locating a workspace admin to test the lo…
asnare Oct 4, 2024
9980c20
Refactor unit tests for the ownership-related classes.
asnare Oct 4, 2024
fded489
Merge branch 'main' into inventory-object-owners
asnare Oct 4, 2024
53da23d
Deal with some comprehension issues.
asnare Oct 4, 2024
83044e8
Implement ownership for the ClusterInfo inventory class.
asnare Oct 4, 2024
ed38942
Docstring for cluster ownership.
asnare Oct 4, 2024
5d4c994
Check some mock interactions.
asnare Oct 4, 2024
348d9b0
Improve docstring clarity.
asnare Oct 4, 2024
5cf6f30
Fix unit test.
asnare Oct 4, 2024
8d4de1f
Suppress pylint false positive.
asnare Oct 4, 2024
4a597a2
Implement ownership for cluster policies.
asnare Oct 4, 2024
1818d46
Fix linting suppression.
asnare Oct 4, 2024
c13b4eb
Use runtime context for integration tests instead of installation con…
asnare Oct 4, 2024
7e66e70
Fix xfail marker for integration test.
asnare Oct 4, 2024
f7942aa
Implement ownership for grants.
asnare Oct 4, 2024
3cb9abf
Use a longer variable name.
asnare Oct 4, 2024
2120322
Whitespace.
asnare Oct 4, 2024
e2189f4
Ownership implementation for tables.
asnare Oct 4, 2024
0d8e48b
Ownership implementation of UDFs.
asnare Oct 4, 2024
28ab56a
Ensure fewer unnecessary mock interactions.
asnare Oct 4, 2024
cc7db1c
Ownership implementation for pipelines.
asnare Oct 4, 2024
0dd3f11
Merge branch 'main' into inventory-object-owners
asnare Oct 4, 2024
64048e1
Merge branch 'main' into inventory-object-owners
asnare Oct 7, 2024
8f0265f
Ownership implementation for Jobs.
asnare Oct 7, 2024
8b944b3
Remove the workspace client from the ownership initializer.
asnare Oct 7, 2024
86582ad
Ownership implementation for the table migration status records.
asnare Oct 7, 2024
b2e66f2
Integration test for table migration ownership.
asnare Oct 7, 2024
953ff62
Stubbed ownership implementation for direct filesystem access records.
asnare Oct 7, 2024
7037b6a
Remove unintentional comment.
asnare Oct 7, 2024
8282051
Type hint.
asnare Oct 7, 2024
3d769c6
Rename: admin_groups -> admin_group_ids
asnare Oct 7, 2024
3c0a5b4
Fix failing integration test.
asnare Oct 7, 2024
9b39e30
Revert a change from this PR.
asnare Oct 7, 2024
7029bf9
Merge branch 'main' into inventory-object-owners
asnare Oct 8, 2024
8d8191d
Simplify creator normalisation.
asnare Oct 8, 2024
94a601d
Rename method: _get_owner() -> _maybe_direct_owner()
asnare Oct 8, 2024
b627890
Simplify the code a bit for locating members of the 'admins' workspac…
asnare Oct 8, 2024
ae8d194
Replace a property (with expensive side-effects) with a getter method.
asnare Oct 8, 2024
a6b5da0
Avoid exposing the admin-finder on the ownership interface.
asnare Oct 8, 2024
33d9c13
Refactor a sequence of generator comprehensions into a for-loop for r…
asnare Oct 8, 2024
d677179
Merge branch 'main' into inventory-object-owners
asnare Oct 8, 2024
b68a6c4
Merge branch 'main' into inventory-object-owners
asnare Oct 9, 2024
c6de109
Remove documentation that the ownership classes report an admin user …
asnare Oct 9, 2024
4deaf93
Docstring improvements.
asnare Oct 9, 2024
47d5343
Fix incorrect term: result -> record
asnare Oct 9, 2024
d74c241
Remove property.
asnare Oct 9, 2024
736ecfb
Merge branch 'main' into inventory-object-owners
asnare Oct 9, 2024
409bdf9
Update some ownership integration tests to verify the complete admin …
asnare Oct 9, 2024
215a1be
Update integration test for cluster ownership.
asnare Oct 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 27 additions & 3 deletions src/databricks/labs/ucx/assessment/clusters.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
)
from databricks.labs.ucx.assessment.init_scripts import CheckInitScriptMixin
from databricks.labs.ucx.framework.crawlers import CrawlerBase
from databricks.labs.ucx.framework.owners import Ownership
from databricks.labs.ucx.framework.utils import escape_sql_identifier

logger = logging.getLogger(__name__)
Expand All @@ -43,6 +44,7 @@ class ClusterInfo:
policy_id: str | None = None
cluster_name: str | None = None
creator: str | None = None
"""User-name of the creator of the cluster, if known."""


class CheckClusterMixin(CheckInitScriptMixin):
Expand Down Expand Up @@ -154,7 +156,8 @@ def _assess_clusters(self, all_clusters):
for cluster in all_clusters:
if cluster.cluster_source == ClusterSource.JOB:
continue
if not cluster.creator_user_name:
creator = cluster.creator_user_name or None
if not creator:
logger.warning(
f"Cluster {cluster.cluster_id} have Unknown creator, it means that the original creator "
f"has been deleted and should be re-created"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the User be recreated or the cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster should either have its owner changed (there is an API for this) or be recreated.

Interestingly there's a mix of terminology here: the property here is creator_user_name but the REST API for changing it is …/clusters/change-owner, and such requests have a owner_username property.

Expand All @@ -164,7 +167,7 @@ def _assess_clusters(self, all_clusters):
cluster_name=cluster.cluster_name,
policy_id=cluster.policy_id,
spark_version=cluster.spark_version,
creator=cluster.creator_user_name,
creator=creator,
success=1,
failures="[]",
)
Expand All @@ -179,6 +182,16 @@ def _try_fetch(self) -> Iterable[ClusterInfo]:
yield ClusterInfo(*row)


class ClusterOwnership(Ownership[ClusterInfo]):
"""Determine ownership of clusters in the inventory.

This is the cluster creator (if known).
"""

def _maybe_direct_owner(self, record: ClusterInfo) -> str | None:
return record.creator


@dataclass
class PolicyInfo:
policy_id: str
Expand All @@ -188,6 +201,7 @@ class PolicyInfo:
spark_version: str | None = None
policy_description: str | None = None
creator: str | None = None
"""User-name of the creator of the cluster policy, if known."""


class PoliciesCrawler(CrawlerBase[PolicyInfo], CheckClusterMixin):
Expand All @@ -210,7 +224,7 @@ def _assess_policies(self, all_policices) -> Iterable[PolicyInfo]:
except KeyError:
spark_version = None
policy_name = policy.name
creator_name = policy.creator_user_name
creator_name = policy.creator_user_name or None

policy_info = PolicyInfo(
policy_id=policy.policy_id,
Expand All @@ -229,3 +243,13 @@ def _assess_policies(self, all_policices) -> Iterable[PolicyInfo]:
def _try_fetch(self) -> Iterable[PolicyInfo]:
for row in self._fetch(f"SELECT * FROM {escape_sql_identifier(self.full_name)}"):
yield PolicyInfo(*row)


class ClusterPolicyOwnership(Ownership[PolicyInfo]):
"""Determine ownership of cluster policies in the inventory.

This is the creator of the cluster policy (if known).
"""

def _maybe_direct_owner(self, record: PolicyInfo) -> str | None:
return record.creator
17 changes: 15 additions & 2 deletions src/databricks/labs/ucx/assessment/jobs.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from databricks.labs.ucx.assessment.clusters import CheckClusterMixin
from databricks.labs.ucx.assessment.crawlers import spark_version_compatibility
from databricks.labs.ucx.framework.crawlers import CrawlerBase
from databricks.labs.ucx.framework.owners import Ownership
from databricks.labs.ucx.framework.utils import escape_sql_identifier

logger = logging.getLogger(__name__)
Expand All @@ -37,6 +38,7 @@ class JobInfo:
failures: str
job_name: str | None = None
creator: str | None = None
"""User-name of the creator of the pipeline, if known."""


class JobsMixin:
Expand Down Expand Up @@ -106,7 +108,8 @@ def _prepare(all_jobs) -> tuple[dict[int, set[str]], dict[int, JobInfo]]:
if not job.job_id:
continue
job_assessment[job.job_id] = set()
if not job.creator_user_name:
creator_user_name = job.creator_user_name or None
if not creator_user_name:
logger.warning(
f"Job {job.job_id} have Unknown creator, it means that the original creator has been deleted "
f"and should be re-created"
Expand All @@ -122,7 +125,7 @@ def _prepare(all_jobs) -> tuple[dict[int, set[str]], dict[int, JobInfo]]:
job_details[job.job_id] = JobInfo(
job_id=str(job.job_id),
job_name=job_name,
creator=job.creator_user_name,
creator=creator_user_name,
success=1,
failures="[]",
)
Expand All @@ -140,6 +143,16 @@ def _check_jar_task(self, all_task: list[RunTask]) -> list[str]:
return task_failures


class JobOwnership(Ownership[JobInfo]):
"""Determine ownership of jobs (workflows) in the inventory.

This is the job creator (if known).
"""

def _maybe_direct_owner(self, record: JobInfo) -> str | None:
return record.creator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about run_as?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. This is not being gathered by the crawler, which means either: a) updating the crawler to include it; b) doing it here.

Wherever this is done, it looks more complicated than at first glance: the run_as property is documented on the SDK as write-only. I'll do some tests to see if this is really the case or not, but if that's correct then it needs to be found somewhere else and I'm not (yet) sure where?

Either way I'd suggest covering this as a separate PR.



@dataclass
class SubmitRunInfo:
run_ids: str # JSON-encoded list of run ids
Expand Down
17 changes: 15 additions & 2 deletions src/databricks/labs/ucx/assessment/pipelines.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from databricks.labs.ucx.assessment.clusters import CheckClusterMixin
from databricks.labs.ucx.framework.crawlers import CrawlerBase
from databricks.labs.ucx.framework.owners import Ownership
from databricks.labs.ucx.framework.utils import escape_sql_identifier

logger = logging.getLogger(__name__)
Expand All @@ -20,6 +21,7 @@ class PipelineInfo:
failures: str
pipeline_name: str | None = None
creator_name: str | None = None
"""User-name of the creator of the pipeline, if known."""


class PipelinesCrawler(CrawlerBase[PipelineInfo], CheckClusterMixin):
Expand All @@ -33,15 +35,16 @@ def _crawl(self) -> Iterable[PipelineInfo]:

def _assess_pipelines(self, all_pipelines) -> Iterable[PipelineInfo]:
for pipeline in all_pipelines:
if not pipeline.creator_user_name:
creator_name = pipeline.creator_user_name or None
if not creator_name:
logger.warning(
f"Pipeline {pipeline.name} have Unknown creator, it means that the original creator "
f"has been deleted and should be re-created"
)
pipeline_info = PipelineInfo(
pipeline_id=pipeline.pipeline_id,
pipeline_name=pipeline.name,
creator_name=pipeline.creator_user_name,
creator_name=creator_name,
success=1,
failures="[]",
)
Expand Down Expand Up @@ -73,3 +76,13 @@ def _pipeline_clusters(self, clusters, failures):
def _try_fetch(self) -> Iterable[PipelineInfo]:
for row in self._fetch(f"SELECT * FROM {escape_sql_identifier(self.full_name)}"):
yield PipelineInfo(*row)


class PipelineOwnership(Ownership[PipelineInfo]):
"""Determine ownership of pipelines in the inventory.

This is the pipeline creator (if known).
"""

def _maybe_direct_owner(self, record: PipelineInfo) -> str | None:
return record.creator_name
5 changes: 5 additions & 0 deletions src/databricks/labs/ucx/contexts/application.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
from databricks.labs.ucx.assessment.export import AssessmentExporter
from databricks.labs.ucx.aws.credentials import CredentialManager
from databricks.labs.ucx.config import WorkspaceConfig
from databricks.labs.ucx.framework.owners import AdministratorLocator
from databricks.labs.ucx.hive_metastore import ExternalLocations, Mounts, TablesCrawler
from databricks.labs.ucx.hive_metastore.catalog_schema import CatalogSchema
from databricks.labs.ucx.hive_metastore.grants import (
Expand Down Expand Up @@ -514,6 +515,10 @@ def migration_recon(self) -> MigrationRecon:
self.config.recon_tolerance_percent,
)

@cached_property
def administrator_locator(self) -> AdministratorLocator:
return AdministratorLocator(self.workspace_client)


class CliContext(GlobalContext, abc.ABC):
@cached_property
Expand Down
2 changes: 1 addition & 1 deletion src/databricks/labs/ucx/contexts/workflow_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def pipelines_crawler(self) -> PipelinesCrawler:

@cached_property
def table_size_crawler(self) -> TableSizeCrawler:
return TableSizeCrawler(self.sql_backend, self.inventory_database, self.config.include_databases)
return TableSizeCrawler(self.tables_crawler)

@cached_property
def policies_crawler(self) -> PoliciesCrawler:
Expand Down
2 changes: 1 addition & 1 deletion src/databricks/labs/ucx/framework/crawlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ class DataclassInstance(Protocol):


class CrawlerBase(ABC, Generic[Result]):
def __init__(self, backend: SqlBackend, catalog: str, schema: str, table: str, klass: type[Result]):
def __init__(self, backend: SqlBackend, catalog: str, schema: str, table: str, klass: type[Result]) -> None:
"""
Initializes a CrawlerBase instance.

Expand Down
Loading
Loading