Description
ParallelCluster version: v3.11.0
aws-parallelcluster
version: v1.1.0
When trying to build an image via Terraform, after exactly 1 hour, the creation on the Terraform side fails. I have verified this 3 times. If I read the relevant CloudWatch logs, the build is still executing at the moment of Terraform failure, and after more time, it finishes correctly. However, the resource is left in a tainted state on Terraform side, forbidding IaC management.
Relevant Terraform output:
module.images.module.images["hpc6a.48xlarge"].aws-parallelcluster_image.main: Still creating... [1h0m1s elapsed]
╷
│ Error: Image create failed to complete.
│
│ with module.images.module.images["hpc6a.48xlarge"].aws-parallelcluster_image.main,
│ on ../../../modules/common/images/image/image.tf line 26, in resource "aws-parallelcluster_image" "main":
│ 26: resource "aws-parallelcluster_image" "main" {
│
│ Error: 403 Forbidden
My first reflex was to think there was a 1 hour timeout on the Terraform resource code, so I created #6891. However, while inspecting the code, I realized it SHOULD have a 3 hours timeout.
Hence, my only suspect is that an assumed role is reaching its max duration or something like that. How can I go and try to debug this problem?
Best regards,