-
Notifications
You must be signed in to change notification settings - Fork 315
[Develop] Introduce Global Cleanup IAM Role for ParallelCluster Build-Image #6912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[Develop] Introduce Global Cleanup IAM Role for ParallelCluster Build-Image #6912
Conversation
* Introduce ensure_cleanup_role() with revision tagging and idempotent creation / update logic. * Modify image_operations_controller to invoke ensure_cleanup_role when Build/Iam/CleanupLambdaRole is not provided and to fail fast on permission errors. * Refactor imagebuilder_stack to remove all per-stack cleanup-role logic and wire Lambda to the global role by default. * Update constants (role prefix / expected revision tag key & value).
IAM / cleanup role ------------------ * IamClient – add create_role, attach_role_policy, put_role_policy, tag_role * ensure_cleanup_role * switch from raw boto3 to AWSApi.instance().iam * 4-step safe update sequence documented, now only after the inline policy succeeds, set or bump the revision tag. * inline-policy builder now receives partition * Lambda VPCAccess managed policy is attached only when LambdaFunctionsVpcConfig exists in the config Tests & utils ------------- * deleted legacy test_imagebuilder_lambda_execution_role. * fixed the failing tests * added new unit tests to cover the new logic * update existing integration test to test the build-image stack can be fully self deleted
@@ -13,6 +13,7 @@ CHANGELOG | |||
|
|||
**BUG FIXES** | |||
- Fix an issue where Security Group validation failed when a rule contained both IPv4 ranges (IpRanges) and security group references (UserIdGroupPairs). | |||
- Fix a bug that caused build-image CloudFormation stacks fail to delete after images are successfully built. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after images are successfully built.
it fails despite the outcome of the build image. even if the build-image fails, the stack is not able to get auto-deletedfail to delete
I'd say fail to auto-delete. Otherwise it may seem that the user cannot even delete the stack manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Moved it to CHANGES.
- Introduce default global build-image cleanup IAM versioning roles to prevent build-image CloudFormation stacks from failing to be automatically deleted after images are either successfully built or fail to build.
# If LambdaFunctionsVpcConfig exists in the config, attach the AWS-managed LambdaVPCAccess policy | ||
has_lambda_functions_vpc_config = cfg_dict.get("DeploymentSettings", {}).get("LambdaFunctionsVpcConfig") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the lambda function vpc configuraitons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See
aws-parallelcluster/cli/src/pcluster/templates/imagebuilder_stack.py
Lines 741 to 742 in 6d4e24d
if self.config.lambda_functions_vpc_config: | |
managed_lambda_policy.append(Fn.sub(LAMBDA_VPC_ACCESS_MANAGED_POLICY)) |
The cleanup role needs the AWSLambdaVPCAccessExecutionRole
when LambdaFunctionsVpcConfig exists in the config.
raw_cfg_str = build_image_request_content["imageConfiguration"] | ||
cfg_dict = yaml.safe_load(raw_cfg_str) or {} | ||
# If CleanupLambdaRole exists in the config, skip ensure_cleanup_role | ||
has_custom_cleanup_role = cfg_dict.get("Build", {}).get("Iam", {}).get("CleanupLambdaRole") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CodeStyle] Encapsulate the logic to check whether a custom role is set into a dedicated function and unit test it independently.
if not has_custom_cleanup_role: | ||
try: | ||
account_id = AWSApi.instance().sts.get_account_id() | ||
ensure_cleanup_role( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CodeStyle] ensure_cleanup_role
is a bit vague name for this function.
The name should reflect that iwe are talking about the default cleanup role and what cleanup role we are talking about
except AWSClientError as e: | ||
if e.error_code in ("AccessDenied", "AccessDeniedException", "UnauthorizedOperation"): | ||
raise BadRequestException( | ||
"Current principal lacks permissions to create or update the ParallelCluster build-image " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Design] Why do we need to update the role>? Didn't we agreed on having versioned roles specifically to not update the policy?
"Action": "sts:AssumeRole", | ||
"Condition": { | ||
"ArnLike": { | ||
"aws:SourceArn": f"arn:{partition}:lambda:*:{account_id}:function:ParallelClusterImage-*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Security] This condition allows every function in the stack to assume this role, but we want only a specific function to be able to. Can we use a more specific name?
try: | ||
resp = iam.get_role(role_name=role_name) | ||
tags = {t["Key"]: t["Value"] for t in resp["Role"].get("Tags", [])} | ||
current_revision = int(tags.get(CLEANUP_ROLE_REVISION_TAG_KEY, 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not using the revision number specified in PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my approach, create role with tag=0. And the expected revision is 1. So that it can enter the if else block if current_revision < PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION
to update the policy.
See my function description, in this approach, it can prevent tag updated but policy not match the expectation.
current_revision = int(tags.get(CLEANUP_ROLE_REVISION_TAG_KEY, 0)) | ||
except AWSClientError as e: | ||
if e.error_code == "NoSuchEntity": | ||
logging.info("Creating global cleanup role %s", role_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify that we are creating it because it does not exists. May sound obvious but since this is a critical resource better to specify why we take some decisions
current_revision = 0 | ||
else: | ||
raise e | ||
# (re)attach AWSLambdaVPCAccessExecutionRole |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why re-attach?
iam.attach_role_policy(role_name, cleanup_role_basic_managed_policy) | ||
|
||
# Put (or overwrite) inline policy | ||
iam.put_role_policy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Design] Do we have a way to avoid policy updates? I thought that by using versioned roles we could do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have, and it has been applied. See if current_revision < PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION:
and the function description for details.
Description of changes
Tests
References
Checklist
develop
add the branch name as prefix in the PR title (e.g.[release-3.6]
).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.