-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Jira: https://asfdaac.atlassian.net/browse/TOOL-3771
Note: The above link is accessible only to members of ASF.
When deploying HyP3 to update an existing CloudFormation stack, it's possible that CloudFormation may attempt and fail to delete a resource, which results in the resource being removed from the stack but not deleted. From the "Resource removed from stack but not deleted" section of the CloudFormation troubleshooting guide:
During a stack update, CloudFormation has removed a resource from a stack but not deleted the resource. The resource still exists, but is no longer accessible through CloudFormation. This may occur during stack updates where:
- CloudFormation needs to replace an existing resource, so it first creates a new resource, then attempts to delete the old resource.
- You have removed the resource from the stack template, so CloudFormation attempts to delete the resource from the stack.
However, there may be cases where CloudFormation can't delete the resource. For example, if the user doesn't have permissions to delete a resource of a given type.
CloudFormation attempts to delete the old resource three times. If CloudFormation can't delete the old resource, it removes the old resource from the stack and continues updating the stack. When the stack update is complete, CloudFormation issues an UPDATE_COMPLETE stack event, but includes a StatusReason that states that one or more resources couldn't be deleted. CloudFormation also issues a DELETE_FAILED event for the specific resource, with a corresponding StatusReason providing more detail on why CloudFormation failed to delete the resource.
To resolve this situation, delete the resource directly using the console or API for the underlying service.
We ran into an issue when deploying HyP3 v9.5.4, which removed the start_execution_worker
and start_execution_manager
Lambda functions and replaced them with a single start_execution
Lambda. However, the JPL service users lacked the cloudformation:DeleteStack
permission and were therefore unable to delete the StartExecutionWorker and StartExecutionManager stacks, which resulted in those stacks being removed from the parent stack but not deleted. This caused jobs to fail unexpectedly because the manager was still pulling jobs and submitting them to the worker.
The important part from the above docs is:
If CloudFormation can't delete the old resource, it removes the old resource from the stack and continues updating the stack. When the stack update is complete, CloudFormation issues an UPDATE_COMPLETE stack event, but includes a StatusReason that states that one or more resources couldn't be deleted.
We didn't notice the issue when updating the stack because the update succeeded with UPDATE_COMPLETE, but we should be able to check the StatusReason and have our GitHub Actions deploy workflow fail if there are orphaned resources (or any other error messages in StatusReason?). For orphaned resources, the StatusReason is: Update successful. One or more resources could not be deleted.
Although, from the CloudFormation logs for the 2025-03-11 deployment, it looks like the StatusReason only appears for the StepFunction stack (which was the parent of the StartExecutionWorker and StartExecutionManager stacks), not for the top-level hyp3 stack. So, it could be tricky to detect orphaned resources in child stacks.
Also see: