Skip to content

[Develop] Introduce Global Cleanup IAM Role for ParallelCluster Build-Image #6912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ CHANGELOG
- Support DCV on Amazon Linux 2023.
- Upgrade Python runtime used by Lambda functions to python3.12 (from python3.9).
- Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management.
- Introduce default global build-image cleanup IAM versioning roles to prevent build-image CloudFormation stacks from failing to be automatically deleted after images are either successfully built or fail to build.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should capture the following details in the changelog entry:

  1. the role is long-living (not bound to the stack lifecycle) and it is meant to persist the stack deletion.
  2. the role name: since the role is meant to persist it's important to communicate its name so that users do not think it is a left over when they see it.
  3. Since this is a long lasting issue in pcluster, it may be good to link the issue we are solving.

The fact that the role is versioned is an implementation details, more than something related to the user experience, so we can skip that detail.

The build-image command now deploys a global role that is used to automatically delete the build-image stack after images either succeed or fail the build. The role is meant to exists even after the stack has been deleted. This is to prevent build-image stack deletion failures, reported in https://github.com/aws/aws-parallelcluster/issues/5914

Not: since the changelog entry is long, use multilines to write it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice! Applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned

the role name: since the role is meant to persist it's important to communicate its name so that users do not think it is a left over when they see it.

But I don't see it in your suggested line. Should I add it?


**BUG FIXES**
- Fix an issue where Security Group validation failed when a rule contained both IPv4 ranges (IpRanges) and security group references (UserIdGroupPairs).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
import logging
import os as os_lib

import yaml

from pcluster.api.controllers.common import (
assert_supported_operation,
configure_aws_region,
Expand Down Expand Up @@ -52,14 +54,15 @@
from pcluster.aws.common import AWSClientError
from pcluster.aws.ec2 import Ec2Client
from pcluster.constants import SUPPORTED_ARCHITECTURES, SUPPORTED_OSES, Operation
from pcluster.imagebuilder_utils import ensure_default_build_image_stack_cleanup_role
from pcluster.models.imagebuilder import (
BadRequestImageBuilderActionError,
ConfigValidationError,
ImageBuilder,
NonExistingImageError,
)
from pcluster.models.imagebuilder_resources import ImageBuilderStack, NonExistingStackError
from pcluster.utils import get_installed_version, to_utc_datetime
from pcluster.utils import get_installed_version, get_partition, to_utc_datetime
from pcluster.validators.common import FailureLevel

LOGGER = logging.getLogger(__name__)
Expand Down Expand Up @@ -105,6 +108,29 @@ def build_image(
validation_failure_level = validation_failure_level or ValidationLevel.ERROR
dryrun = dryrun or False

raw_cfg_str = build_image_request_content["imageConfiguration"]
cfg_dict = yaml.safe_load(raw_cfg_str) or {}
# If CleanupLambdaRole exists in the config, skip ensure_default_build_image_stack_cleanup_role
has_custom_cleanup_role = cfg_dict.get("Build", {}).get("Iam", {}).get("CleanupLambdaRole")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CodeStyle] Encapsulate the logic to check whether a custom role is set into a dedicated function and unit test it independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary. It's a common dig logic and the logic to test if it can correctly dig is already include in the unit test test_enable_cleanup_role_call_and_vpc_flag.

# If LambdaFunctionsVpcConfig exists in the config, attach the AWS-managed LambdaVPCAccess policy
has_lambda_functions_vpc_config = cfg_dict.get("DeploymentSettings", {}).get("LambdaFunctionsVpcConfig")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the lambda function vpc configuraitons?

Copy link
Contributor Author

@hehe7318 hehe7318 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See

if self.config.lambda_functions_vpc_config:
managed_lambda_policy.append(Fn.sub(LAMBDA_VPC_ACCESS_MANAGED_POLICY))

The cleanup role needs the AWSLambdaVPCAccessExecutionRole when LambdaFunctionsVpcConfig exists in the config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thank you.
Nitpick: we need this line only if the user is using the default role, so you can move the line into the if branch below.

Copy link
Contributor Author

@hehe7318 hehe7318 Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, thanks for catching it. Done.


if not has_custom_cleanup_role:
try:
account_id = AWSApi.instance().sts.get_account_id()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this sts call fails, we would return an error message saying that the principal is missing the permissions to create/update the role. Make sure to add sts permissions to our documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the same call in other places when doing build-image. So no need to add it.

And ref: https://docs.aws.amazon.com/STS/latest/APIReference/API_GetCallerIdentity.html

GetCallerIdentity
Returns details about the IAM user or role whose credentials are used to call the operation.
Note
No permissions are required to perform this operation. If an administrator attaches a policy to your identity that explicitly denies access to the sts:GetCallerIdentity action, you can still perform this operation. Permissions are not required because the same information is returned when access is denied. To view an example response, see I Am Not Authorized to Perform: iam:DeleteVirtualMFADevice in the IAM User Guide.

ensure_default_build_image_stack_cleanup_role(
account_id, get_partition(), attach_vpc_access_policy=bool(has_lambda_functions_vpc_config)
)
except AWSClientError as e:
if e.error_code in ("AccessDenied", "AccessDeniedException", "UnauthorizedOperation"):
raise BadRequestException(
"Current principal lacks permissions to create or update the ParallelCluster build-image "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] Why do we need to update the role>? Didn't we agreed on having versioned roles specifically to not update the policy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reached an agreement yesterday. Done. Changed the logic to create roles with revision in their name.

"cleanup IAM role. "
"Either pass `Build/Iam/CleanupLambdaRole` or grant the missing permissions to continue. "
"For detailed instructions, please refer to our public documentation."
)
raise

build_image_request_content = BuildImageRequestContent.from_dict(build_image_request_content)

try:
Expand Down
20 changes: 20 additions & 0 deletions cli/src/pcluster/aws/iam.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,23 @@ def get_role(self, role_name):
def get_instance_profile(self, instance_profile_name):
"""Get instance profile information."""
return self._client.get_instance_profile(InstanceProfileName=instance_profile_name)

@AWSExceptionHandler.handle_client_exception
def create_role(self, **kwargs):
"""Create IAM role."""
return self._client.create_role(**kwargs)

@AWSExceptionHandler.handle_client_exception
def attach_role_policy(self, role_name, policy_arn):
"""Attach a managed policy to the given role."""
return self._client.attach_role_policy(RoleName=role_name, PolicyArn=policy_arn)

@AWSExceptionHandler.handle_client_exception
def put_role_policy(self, role_name, policy_name, policy_document):
"""Create or replace the specified inline policy on a role."""
return self._client.put_role_policy(RoleName=role_name, PolicyName=policy_name, PolicyDocument=policy_document)

@AWSExceptionHandler.handle_client_exception
def tag_role(self, role_name, tags):
"""Add or overwrite one or more tags for the specified role."""
return self._client.tag_role(RoleName=role_name, Tags=tags)
5 changes: 5 additions & 0 deletions cli/src/pcluster/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -335,3 +335,8 @@ class Operation(Enum):
PCLUSTER_BUCKET_PROTECTED_FOLDER = "parallelcluster"
PCLUSTER_BUCKET_PROTECTED_PREFIX = f"{PCLUSTER_BUCKET_PROTECTED_FOLDER}/"
PCLUSTER_BUCKET_REQUIRED_BOOTSTRAP_FEATURES = ["basic", "export-logs"]

PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_PREFIX = "PClusterBuildImageCleanupRole"
# Tag key & expected revision (increment when policy widens)
PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_REVISION = 1
PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_BOOTSTRAP_TAG_KEY = "parallelcluster:build-image-cleanup-role-bootstrapped"
183 changes: 182 additions & 1 deletion cli/src/pcluster/imagebuilder_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,21 @@
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
# limitations under the License.
import json
import logging
import os

import yaml

from pcluster.aws.aws_api import AWSApi
from pcluster.utils import get_url_scheme, yaml_load
from pcluster.aws.common import AWSClientError
from pcluster.constants import (
IAM_ROLE_PATH,
PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_BOOTSTRAP_TAG_KEY,
PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_PREFIX,
PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_REVISION,
)
from pcluster.utils import generate_string_hash, get_url_scheme, yaml_load

ROOT_VOLUME_TYPE = "gp3"
PCLUSTER_RESERVED_VOLUME_SIZE = 37
Expand Down Expand Up @@ -65,3 +74,175 @@ def _generate_action(action_name, commands):
"""Generate action in imagebuilder components."""
action = {"name": action_name, "action": "ExecuteBash", "inputs": {"commands": [commands]}}
return action


def get_cleanup_role_name(account_id: str) -> str:
"""Return the role name including a revision number."""
hashed_account_id = generate_string_hash(account_id)
return (
f"{PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_PREFIX}-{hashed_account_id}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This role is supposed to not be deleted or modified by the user.
What about adding a DO-NOT-DELETE suffix in the role name?
This is a standard approach, that we also adopted for the S3 bucket, which has a similar lifecycle.

What about PClusterBuildImageCleanupRole-v{revision}-{hash}-do-not-delete?
If it is too long, we can use PClusterImageCleanupRole-v{revision}-{hash}-do-not-delete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are different. Delete the S3 bucket have impact to the running cluster because it stores some files required by the cluster. But deleting this user and running build-image again can create a new one without any impact.

And customer can customized the role policy. We added a bootstrapped tag for it so once it bootstrapped we will not update/override it again. It makes sense that customer customized it and take the responsibility. If their customization broke it, they can simply delete it and let pcluster create a new one.

Copy link
Contributor

@gmarciani gmarciani Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not so different: if a user deletes the role after submitting the build-image, all the ongoing build image stacks are going to fail the deletion. The bucket as well gets recreated if the user deletes it, nevertheless we decided to include such suffix to make sure users do not delete it.

A valid counter argument may be: this is true in general for whatever resource created by pcluster, so why adding do-not-delete to this role and not to other things?
And the answer would be: because it is a resource decoupled from the lifecycle of the stacks, the same way the bucket is.

As an alternative approach we can avoid the do-not-delete suffix in the role name and specify it in the description of the role. With buckets we do not have such possibility, but with roles we do. Now I'm thinking that it is even better so that we can communicate the requirement while keeping the role name shorter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with adding it in the role description. But I mean, after the bucket is re-created, the files can not be restored. So the cluster keeps failing. But if the role is re-created, everything runs correctly.

And the stack can be auto deleted after the image succeed/fail to build. So only the cleanup role is deleted when the stack is still there, the stack will fail to be deleted, and that's customer's responsibility. I don't think customer will do that when the stack hasn't been deleted because it's very clear in the role name that it's being used to cleanup the build-image resources.

I agree with you to add do not delete in description. But include, do not delete it during the build-image is ongoing, or you will see the stack fail to be deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed: let's put a note in the description, but let's keep it simple: let's simply say that the role must not be deleted without expressing all the scenarios.

f"-revision-{PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_REVISION}"
)


def _expected_inline_policy(account_id: str, partition: str):
"""Return the inline policy document (JSON-serialised string)."""
return json.dumps(
{
"Version": "2012-10-17",
"Statement": [
{
"Action": ["iam:DetachRolePolicy", "iam:DeleteRole", "iam:DeleteRolePolicy"],
"Resource": f"arn:{partition}:iam::{account_id}:role/parallelcluster/*",
"Effect": "Allow",
},
{
"Action": ["iam:DeleteInstanceProfile", "iam:RemoveRoleFromInstanceProfile"],
"Resource": f"arn:{partition}:iam::{account_id}:instance-profile/parallelcluster/*",
"Effect": "Allow",
},
{
"Action": "imagebuilder:DeleteInfrastructureConfiguration",
"Resource": f"arn:{partition}:imagebuilder:*:{account_id}:infrastructure-configuration/"
f"parallelclusterimage-*",
"Effect": "Allow",
},
{
"Action": ["imagebuilder:DeleteComponent"],
"Resource": [f"arn:{partition}:imagebuilder:*:{account_id}:component/parallelclusterimage-*/*"],
"Effect": "Allow",
},
{
"Action": "imagebuilder:DeleteImageRecipe",
"Resource": f"arn:{partition}:imagebuilder:*:{account_id}:image-recipe/parallelclusterimage-*/*",
"Effect": "Allow",
},
{
"Action": "imagebuilder:DeleteDistributionConfiguration",
"Resource": f"arn:{partition}:imagebuilder:*:{account_id}:distribution-configuration/"
f"parallelclusterimage-*",
"Effect": "Allow",
},
{
"Action": ["imagebuilder:DeleteImage", "imagebuilder:GetImage", "imagebuilder:CancelImageCreation"],
"Resource": f"arn:{partition}:imagebuilder:*:{account_id}:image/parallelclusterimage-*/*",
"Effect": "Allow",
},
{
"Action": "cloudformation:DeleteStack",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Security] Can we scope this down using conditions on tags? For example, we known that every pcluster stack will have at least pcluster version tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref: https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements_condition_operators.html
I changed it to:

                {
                    "Action": "cloudformation:DeleteStack",
                    "Resource": f"arn:{partition}:cloudformation:*:{account_id}:stack/*/*",
                    "Condition": {
                        "ForAnyValue:StringLike": {
                            "cloudformation:ResourceTag/parallelcluster:image_id": "*"
                        }
                    },
                    "Effect": "Allow"
                },

Testing if it works.

"Resource": f"arn:{partition}:cloudformation:*:{account_id}:stack/*/*",
"Condition": {
"ForAnyValue:StringLike": {"cloudformation:ResourceTag/parallelcluster:image_id": "*"}
},
"Effect": "Allow",
},
# The below two permissions are required for the DeleteStackFunction Lambda to tag the
# created AMI with 'parallelcluster:build_status' and 'parallelcluster:parent_image' tags
{"Action": "ec2:CreateTags", "Resource": f"arn:{partition}:ec2:*::image/*", "Effect": "Allow"},
{"Action": "tag:TagResources", "Resource": f"arn:{partition}:ec2:*::image/*", "Effect": "Allow"},
{
"Action": ["lambda:DeleteFunction", "lambda:RemovePermission"],
"Resource": f"arn:{partition}:lambda:*:{account_id}:function:ParallelClusterImage-*",
"Effect": "Allow",
},
{
"Action": "logs:DeleteLogGroup",
"Resource": f"arn:{partition}:logs:*:{account_id}:log-group:/aws/lambda/ParallelClusterImage-*:*",
"Effect": "Allow",
},
{
"Action": [
"SNS:GetTopicAttributes",
"SNS:DeleteTopic",
"SNS:GetSubscriptionAttributes",
"SNS:Unsubscribe",
],
"Resource": f"arn:{partition}:sns:*:{account_id}:ParallelClusterImage-*",
"Effect": "Allow",
},
],
}
)


def ensure_default_build_image_stack_cleanup_role(
account_id: str, partition="aws", attach_vpc_access_policy: bool = False
) -> str:
"""
Ensure the global (account-wide) cleanup role exists and is at the expected revision.

The function follows a safe order:
1. If the role does not exist, create it without the bootstrapped tag.
2. If LambdaFunctionsVpcConfig exists in the config, attach the AWS-managed LambdaVPCAccess policy.
3. Attach the AWS-managed Lambda basic policy.
4. Update/write the inline policy (least-privilege cleanup policy).
5. Only after the inline policy succeeds, set the bootstrapped tag.

This way, if step 2, 3 or 4 fails (e.g., lack of iam:PutRolePolicy permission),
future invocations will keep retrying.
"""
iam = AWSApi.instance().iam
role_name = get_cleanup_role_name(account_id)
role_arn = f"arn:{partition}:iam::{account_id}:role{IAM_ROLE_PATH}{role_name}"

# Assume-role trust policy
assume_doc = {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {"Service": "lambda.amazonaws.com"},
"Action": "sts:AssumeRole",
"Condition": {
"ArnLike": {
"aws:SourceArn": f"arn:{partition}:lambda:*:{account_id}:function:ParallelClusterImage-*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Security] This condition allows every function in the stack to assume this role, but we want only a specific function to be able to. Can we use a more specific name?

Copy link
Contributor Author

@hehe7318 hehe7318 Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we can't, the delete stack lambda function is named like "ParallelClusterImage-15a09820-6274-11f0-8162-0e903a4ee48d".

}
},
}
],
}
# Check whether the role already exists
try:
resp = iam.get_role(role_name=role_name)
tags = {t["Key"]: t["Value"] for t in resp["Role"].get("Tags", [])}
already_bootstrapped = tags.get(PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_BOOTSTRAP_TAG_KEY, "").lower() == "true"
except AWSClientError as e:
if e.error_code == "NoSuchEntity":
logging.info("Creating default build-image stack cleanup role %s because it does not exists.", role_name)
iam.create_role(
RoleName=role_name,
Path=IAM_ROLE_PATH,
AssumeRolePolicyDocument=json.dumps(assume_doc),
Description="AWS ParallelCluster build-image cleanup Lambda execution role",
)
already_bootstrapped = False
else:
raise

# Attach AWSLambdaVPCAccessExecutionRole
if attach_vpc_access_policy:
iam.attach_role_policy(
role_name,
f"arn:{partition}:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole",
)

if already_bootstrapped:
return role_arn

# Attach AWSLambdaBasicExecutionRole
cleanup_role_basic_managed_policy = f"arn:{partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
iam.attach_role_policy(role_name, cleanup_role_basic_managed_policy)

# Put inline policy
iam.put_role_policy(
role_name=role_name,
policy_name="ParallelClusterCleanupInline",
policy_document=_expected_inline_policy(account_id, partition),
)

# Set bootstrapped tag after policy write succeeds
iam.tag_role(
role_name=role_name,
tags=[{"Key": PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_BOOTSTRAP_TAG_KEY, "Value": "true"}],
)
return role_arn
Loading
Loading