Skip to content

[3.13.0] Address cluster update failure when old capacity reservation has been deleted #6869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

hanwen-cluster
Copy link
Contributor

Cherry picked from #6867

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hanwen-cluster hanwen-cluster requested review from a team as code owners June 11, 2025 21:58
@hanwen-cluster hanwen-cluster added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Jun 11, 2025
capacity_reservations = AWSApi.instance().ec2.describe_capacity_reservations([capacity_reservation_id])
if capacity_reservations:
instance_type = capacity_reservations[0].instance_type()
except AWSClientError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Can we check if AWSClientError can also be cause by any other reason.

If this is caused because of any other reason and we set InstanceType None, will it affect the ComputeFleet during scaling where the launch template defines the Capacity reservation but doesnt define the instance-Type?

NOTE: The non-existance of this InstanceType can affect slurm configuration files to some extent where we wont be able to use Feature constrains using instanceType

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before FDM's fix

pcluster create-cluster --cluster-configuration /Users/hanwenli/.parallelcluster/basicconfig --cluster-name cluster-name-dcv --region us-east-1
{
  "message": "Invalid cluster configuration: An error occurred when calling the DescribeCapacityReservations operation: The capacity reservation ID 'cr-0c05c40da26fd1111' was not found"
}

After the fix:

pcluster create-cluster --cluster-configuration /Users/hanwenli/.parallelcluster/basicconfig --cluster-name cluster-name-dcv --region us-east-1
{
  "message": "Invalid cluster configuration: Error validating parameter. Failed with exception: Parameter validation failed:\nInvalid type for parameter InstanceTypes[0], value: None, type: <class 'NoneType'>, valid types: <class 'str'>"
}

Anyway, the cluster creation will fail. This fix is really limited to fix update-cluster and update-compute-fleet

@hanwen-cluster hanwen-cluster merged commit e90b85c into aws:integ-tests-3.13.0 Jun 11, 2025
25 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog-update Disables the check that enforces changelog updates in PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants