CloudFormation updates may fail to delete resources

**Jira:** https://asfdaac.atlassian.net/browse/TOOL-3771

*Note: The above link is accessible only to members of ASF.*

----------------------------------------------------------------------------------------------------


When deploying HyP3 to update an existing CloudFormation stack, it's possible that CloudFormation may attempt and fail to delete a resource, which results in the resource being removed from the stack but not deleted. From the "Resource removed from stack but not deleted" section of the [CloudFormation troubleshooting guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/troubleshooting.html#troubleshooting-errors-resource-removed-not-deleted):

> During a stack update, CloudFormation has removed a resource from a stack but not deleted the resource. The resource still exists, but is no longer accessible through CloudFormation. This may occur during stack updates where:
> 
> - CloudFormation needs to replace an existing resource, so it first creates a new resource, then attempts to delete the old resource.
> - You have removed the resource from the stack template, so CloudFormation attempts to delete the resource from the stack.
> 
> However, there may be cases where CloudFormation can't delete the resource. For example, if the user doesn't have permissions to delete a resource of a given type.
> 
> CloudFormation attempts to delete the old resource three times. If CloudFormation can't delete the old resource, it removes the old resource from the stack and continues updating the stack. When the stack update is complete, CloudFormation issues an UPDATE_COMPLETE stack event, but includes a StatusReason that states that one or more resources couldn't be deleted. CloudFormation also issues a DELETE_FAILED event for the specific resource, with a corresponding StatusReason providing more detail on why CloudFormation failed to delete the resource.
> 
> To resolve this situation, delete the resource directly using the console or API for the underlying service.

We ran into an issue when deploying [HyP3 v9.5.4](https://github.com/ASFHyP3/hyp3/releases/tag/v9.5.4), which removed the `start_execution_worker` and `start_execution_manager` Lambda functions and replaced them with a single `start_execution` Lambda. However, the JPL service users lacked the `cloudformation:DeleteStack` permission and were therefore unable to delete the StartExecutionWorker and StartExecutionManager stacks, which resulted in those stacks being removed from the parent stack but not deleted. This caused jobs to fail unexpectedly because the manager was still pulling jobs and submitting them to the worker.

The important part from the above docs is:

> If CloudFormation can't delete the old resource, it removes the old resource from the stack and continues updating the stack. When the stack update is complete, CloudFormation issues an UPDATE_COMPLETE stack event, but includes a StatusReason that states that one or more resources couldn't be deleted.

We didn't notice the issue when updating the stack because the update succeeded with UPDATE_COMPLETE, but we should be able to check the StatusReason and have our GitHub Actions deploy workflow fail if there are orphaned resources (or any other error messages in StatusReason?). For orphaned resources, the StatusReason is: `Update successful. One or more resources could not be deleted.`

Although, from the CloudFormation logs for the 2025-03-11 deployment, it looks like the StatusReason only appears for the StepFunction stack (which was the parent of the StartExecutionWorker and StartExecutionManager stacks), *not* for the top-level hyp3 stack. So, it could be tricky to detect orphaned resources in child stacks.

Also see:
- https://chat.asf.alaska.edu/asf/pl/wdexuso7ifybtcbwczqupwyufo
- https://chat.asf.alaska.edu/asf/pl/1krxsdtfstf59xogfi98xn8wqc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CloudFormation updates may fail to delete resources #2800

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CloudFormation updates may fail to delete resources #2800

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions