Skip to content

[Develop] Introduce Global Cleanup IAM Role for ParallelCluster Build-Image #6912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ CHANGELOG

**BUG FIXES**
- Fix an issue where Security Group validation failed when a rule contained both IPv4 ranges (IpRanges) and security group references (UserIdGroupPairs).
- Fix a bug that caused build-image CloudFormation stacks fail to delete after images are successfully built.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. after images are successfully built. it fails despite the outcome of the build image. even if the build-image fails, the stack is not able to get auto-deleted
  2. fail to delete I'd say fail to auto-delete. Otherwise it may seem that the user cannot even delete the stack manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved it to CHANGES.
- Introduce default global build-image cleanup IAM versioning roles to prevent build-image CloudFormation stacks from failing to be automatically deleted after images are either successfully built or fail to build.


3.13.2
------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
import logging
import os as os_lib

import yaml

from pcluster.api.controllers.common import (
assert_supported_operation,
configure_aws_region,
Expand Down Expand Up @@ -52,14 +54,15 @@
from pcluster.aws.common import AWSClientError
from pcluster.aws.ec2 import Ec2Client
from pcluster.constants import SUPPORTED_ARCHITECTURES, SUPPORTED_OSES, Operation
from pcluster.imagebuilder_utils import ensure_cleanup_role
from pcluster.models.imagebuilder import (
BadRequestImageBuilderActionError,
ConfigValidationError,
ImageBuilder,
NonExistingImageError,
)
from pcluster.models.imagebuilder_resources import ImageBuilderStack, NonExistingStackError
from pcluster.utils import get_installed_version, to_utc_datetime
from pcluster.utils import get_installed_version, get_partition, to_utc_datetime
from pcluster.validators.common import FailureLevel

LOGGER = logging.getLogger(__name__)
Expand Down Expand Up @@ -105,6 +108,31 @@ def build_image(
validation_failure_level = validation_failure_level or ValidationLevel.ERROR
dryrun = dryrun or False

raw_cfg_str = build_image_request_content["imageConfiguration"]
cfg_dict = yaml.safe_load(raw_cfg_str) or {}
# If CleanupLambdaRole exists in the config, skip ensure_cleanup_role
has_custom_cleanup_role = cfg_dict.get("Build", {}).get("Iam", {}).get("CleanupLambdaRole")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CodeStyle] Encapsulate the logic to check whether a custom role is set into a dedicated function and unit test it independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary. It's a common dig logic and the logic to test if it can correctly dig is already include in the unit test test_enable_cleanup_role_call_and_vpc_flag.

# If LambdaFunctionsVpcConfig exists in the config, attach the AWS-managed LambdaVPCAccess policy
has_lambda_functions_vpc_config = cfg_dict.get("DeploymentSettings", {}).get("LambdaFunctionsVpcConfig")
Comment on lines +115 to +116
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the lambda function vpc configuraitons?

Copy link
Contributor Author

@hehe7318 hehe7318 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See

if self.config.lambda_functions_vpc_config:
managed_lambda_policy.append(Fn.sub(LAMBDA_VPC_ACCESS_MANAGED_POLICY))

The cleanup role needs the AWSLambdaVPCAccessExecutionRole when LambdaFunctionsVpcConfig exists in the config.


if not has_custom_cleanup_role:
try:
account_id = AWSApi.instance().sts.get_account_id()
ensure_cleanup_role(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CodeStyle] ensure_cleanup_role is a bit vague name for this function.
The name should reflect that iwe are talking about the default cleanup role and what cleanup role we are talking about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, modified to ensure_default_build_image_stack_cleanup_role

account_id, get_partition(), attach_vpc_access_policy=True if has_lambda_functions_vpc_config else False
)
except AWSClientError as e:
if e.error_code in ("AccessDenied", "AccessDeniedException", "UnauthorizedOperation"):
raise BadRequestException(
"Current principal lacks permissions to create or update the ParallelCluster build-image "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] Why do we need to update the role>? Didn't we agreed on having versioned roles specifically to not update the policy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reached an agreement yesterday. Done. Changed the logic to create roles with revision in their name.

"cleanup IAM role. "
"Either pass `Build/Iam/CleanupLambdaRole` or grant the missing permissions to continue. "
"For detailed instructions, please refer to our documentation. "
"https://docs.aws.amazon.com/parallelcluster/latest/ug/iam-roles-in-parallelcluster-v3.html#iam-rol"
"es-in-parallelcluster-v3-user-policy-build-image"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[UX] We should not add links to specific pages in our public documentation. Those links may change overtime and we would end up with CLI messages containing broken links.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted.

)
raise

build_image_request_content = BuildImageRequestContent.from_dict(build_image_request_content)

try:
Expand Down
20 changes: 20 additions & 0 deletions cli/src/pcluster/aws/iam.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,23 @@ def get_role(self, role_name):
def get_instance_profile(self, instance_profile_name):
"""Get instance profile information."""
return self._client.get_instance_profile(InstanceProfileName=instance_profile_name)

@AWSExceptionHandler.handle_client_exception
def create_role(self, **kwargs):
"""Create IAM role."""
return self._client.create_role(**kwargs)

@AWSExceptionHandler.handle_client_exception
def attach_role_policy(self, role_name, policy_arn):
"""Attach a managed policy to the given role."""
return self._client.attach_role_policy(RoleName=role_name, PolicyArn=policy_arn)

@AWSExceptionHandler.handle_client_exception
def put_role_policy(self, role_name, policy_name, policy_document):
"""Create or replace the specified inline policy on a role."""
return self._client.put_role_policy(RoleName=role_name, PolicyName=policy_name, PolicyDocument=policy_document)

@AWSExceptionHandler.handle_client_exception
def tag_role(self, role_name, tags):
"""Add or overwrite one or more tags for the specified role."""
return self._client.tag_role(RoleName=role_name, Tags=tags)
5 changes: 5 additions & 0 deletions cli/src/pcluster/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -335,3 +335,8 @@ class Operation(Enum):
PCLUSTER_BUCKET_PROTECTED_FOLDER = "parallelcluster"
PCLUSTER_BUCKET_PROTECTED_PREFIX = f"{PCLUSTER_BUCKET_PROTECTED_FOLDER}/"
PCLUSTER_BUCKET_REQUIRED_BOOTSTRAP_FEATURES = ["basic", "export-logs"]

PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_PREFIX = "PClusterBuildImageCleanupRole"
# Tag key & expected revision (increment when policy widens)
CLEANUP_ROLE_REVISION_TAG_KEY = "parallelcluster:cleanup-role-revision"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to be more specific about what cleanup role we are referring to.
What if tomorrow we need to add other cleanup roles?
Example: parallelcluster:build-image-cleanup-role-revision

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted. No longer applicable.

PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] Why EXPECTED? I think we can name it PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_REVISION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

180 changes: 180 additions & 0 deletions cli/src/pcluster/imagebuilder_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,21 @@
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
# limitations under the License.
import json
import logging
import os
from hashlib import sha256

import yaml

from pcluster.aws.aws_api import AWSApi
from pcluster.aws.common import AWSClientError
from pcluster.constants import (
CLEANUP_ROLE_REVISION_TAG_KEY,
IAM_ROLE_PATH,
PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION,
PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_PREFIX,
)
from pcluster.utils import get_url_scheme, yaml_load

ROOT_VOLUME_TYPE = "gp3"
Expand Down Expand Up @@ -65,3 +75,173 @@ def _generate_action(action_name, commands):
"""Generate action in imagebuilder components."""
action = {"name": action_name, "action": "ExecuteBash", "inputs": {"commands": [commands]}}
return action


def _hash(account_id: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function to return an hash may be useful even elsewhere, not only for build image.
What about adding it to a general utils module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""Return first 12 hex chars of sha256(account_id)."""
return sha256(account_id.encode()).hexdigest()[:12]


def get_cleanup_role_name(account_id: str) -> str:
return f"{PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_PREFIX}-{_hash(account_id)}"


def _expected_inline_policy(account_id: str, partition: str):
"""Return the inline policy document (JSON-serialised string)."""
return json.dumps(
{
"Version": "2012-10-17",
"Statement": [
{
"Action": ["iam:DetachRolePolicy", "iam:DeleteRole", "iam:DeleteRolePolicy"],
"Resource": f"arn:{partition}:iam::{account_id}:role/parallelcluster/*",
"Effect": "Allow",
},
{
"Action": ["iam:DeleteInstanceProfile", "iam:RemoveRoleFromInstanceProfile"],
"Resource": f"arn:{partition}:iam::{account_id}:instance-profile/parallelcluster/*",
"Effect": "Allow",
},
{
"Action": "imagebuilder:DeleteInfrastructureConfiguration",
"Resource": f"arn:{partition}:imagebuilder:*:{account_id}:infrastructure-configuration/"
f"parallelclusterimage-*",
"Effect": "Allow",
},
{
"Action": ["imagebuilder:DeleteComponent"],
"Resource": [f"arn:{partition}:imagebuilder:*:{account_id}:component/parallelclusterimage-*/*"],
"Effect": "Allow",
},
{
"Action": "imagebuilder:DeleteImageRecipe",
"Resource": f"arn:{partition}:imagebuilder:*:{account_id}:image-recipe/parallelclusterimage-*/*",
"Effect": "Allow",
},
{
"Action": "imagebuilder:DeleteDistributionConfiguration",
"Resource": f"arn:{partition}:imagebuilder:*:{account_id}:distribution-configuration/"
f"parallelclusterimage-*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string not needed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed. It's the same string. But due to length restriction so have to split to two lines.

                    "Resource": f"arn:{partition}:imagebuilder:*:{account_id}:distribution-configuration/"
                    f"parallelclusterimage-*",

"Effect": "Allow",
},
{
"Action": ["imagebuilder:DeleteImage", "imagebuilder:GetImage", "imagebuilder:CancelImageCreation"],
"Resource": f"arn:{partition}:imagebuilder:*:{account_id}:image/parallelclusterimage-*/*",
"Effect": "Allow",
},
{
"Action": "cloudformation:DeleteStack",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Security] Can we scope this down using conditions on tags? For example, we known that every pcluster stack will have at least pcluster version tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref: https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements_condition_operators.html
I changed it to:

                {
                    "Action": "cloudformation:DeleteStack",
                    "Resource": f"arn:{partition}:cloudformation:*:{account_id}:stack/*/*",
                    "Condition": {
                        "ForAnyValue:StringLike": {
                            "cloudformation:ResourceTag/parallelcluster:image_id": "*"
                        }
                    },
                    "Effect": "Allow"
                },

Testing if it works.

"Resource": f"arn:{partition}:cloudformation:*:{account_id}:stack/*/*",
"Effect": "Allow",
},
# The below two permissions are required for the DeleteStackFunction Lambda to tag the
# created AMI with 'parallelcluster:build_status' and 'parallelcluster:parent_image' tags
{"Action": "ec2:CreateTags", "Resource": f"arn:{partition}:ec2:*::image/*", "Effect": "Allow"},
{"Action": "tag:TagResources", "Resource": "*", "Effect": "Allow"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Security] Why on all resources? Can't we scope this down to specific resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked Lambda function codes. Done.

{
"Action": ["lambda:DeleteFunction", "lambda:RemovePermission"],
"Resource": f"arn:{partition}:lambda:*:{account_id}:function:ParallelClusterImage-*",
"Effect": "Allow",
},
{
"Action": "logs:DeleteLogGroup",
"Resource": f"arn:{partition}:logs:*:{account_id}:log-group:/aws/lambda/ParallelClusterImage-*:*",
"Effect": "Allow",
},
{
"Action": [
"SNS:GetTopicAttributes",
"SNS:DeleteTopic",
"SNS:GetSubscriptionAttributes",
"SNS:Unsubscribe",
],
"Resource": f"arn:{partition}:sns:*:{account_id}:ParallelClusterImage-*",
"Effect": "Allow",
},
],
}
)


def ensure_cleanup_role(account_id: str, partition: str = "aws", attach_vpc_access_policy: bool = False):
"""
Ensure the global (account-wide) cleanup role exists and is at the expected revision.
The function follows a safe order:
1. If the role does not exist, create it without the revision tag.
2. If LambdaFunctionsVpcConfig exists in the config, attach the AWS-managed LambdaVPCAccess policy
3. Attach the AWS-managed Lambda basic policy.
4. Update/write the inline policy (least-privilege cleanup policy).
5. Only after the inline policy succeeds, set or bump the revision tag.
This way, if step 2 or 3 fails (e.g., lack of iam:PutRolePolicy permission) the tag
stays at an older revision and future invocations will keep retrying.
"""
iam = AWSApi.instance().iam
role_name = get_cleanup_role_name(account_id)
role_arn = f"arn:{partition}:iam::{account_id}:role{IAM_ROLE_PATH}{role_name}"

# Assume-role trust policy
assume_doc = {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {"Service": "lambda.amazonaws.com"},
"Action": "sts:AssumeRole",
"Condition": {
"ArnLike": {
"aws:SourceArn": f"arn:{partition}:lambda:*:{account_id}:function:ParallelClusterImage-*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Security] This condition allows every function in the stack to assume this role, but we want only a specific function to be able to. Can we use a more specific name?

Copy link
Contributor Author

@hehe7318 hehe7318 Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we can't, the delete stack lambda function is named like "ParallelClusterImage-15a09820-6274-11f0-8162-0e903a4ee48d".

}
},
}
],
}

# Check whether the role already exists
try:
resp = iam.get_role(role_name=role_name)
tags = {t["Key"]: t["Value"] for t in resp["Role"].get("Tags", [])}
current_revision = int(tags.get(CLEANUP_ROLE_REVISION_TAG_KEY, 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using the revision number specified in PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my approach, create role with tag=0. And the expected revision is 1. So that it can enter the if else block if current_revision < PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION to update the policy.
See my function description, in this approach, it can prevent tag updated but policy not match the expectation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer applicable. But the logic didn't change.

except AWSClientError as e:
if e.error_code == "NoSuchEntity":
logging.info("Creating global cleanup role %s", role_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify that we are creating it because it does not exists. May sound obvious but since this is a critical resource better to specify why we take some decisions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

iam.create_role(
RoleName=role_name,
Path=IAM_ROLE_PATH,
AssumeRolePolicyDocument=json.dumps(assume_doc),
Description="AWS ParallelCluster build-image cleanup Lambda execution role",
)
current_revision = 0
else:
raise e
# (re)attach AWSLambdaVPCAccessExecutionRole
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why re-attach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced yesterday. It's idempotent

if attach_vpc_access_policy:
iam.attach_role_policy(
role_name, f"arn:{partition}:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
)

if current_revision < PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION:
# (re)attach AWSLambdaBasicExecutionRole
cleanup_role_basic_managed_policy = f"arn:{partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
iam.attach_role_policy(role_name, cleanup_role_basic_managed_policy)

# Put (or overwrite) inline policy
iam.put_role_policy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] Do we have a way to avoid policy updates? I thought that by using versioned roles we could do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have, and it has been applied. See if current_revision < PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION: and the function description for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to have revision in the role name and do not update the role.

role_name=role_name,
policy_name="ParallelClusterCleanupInline",
policy_document=_expected_inline_policy(account_id, partition),
)

# Set/update revision tag after policy write succeeds
iam.tag_role(
role_name=role_name,
tags=[
{
"Key": CLEANUP_ROLE_REVISION_TAG_KEY,
"Value": str(PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION),
}
],
)

return role_arn
Loading
Loading