Cluster Update Failure When Adding a New Slurm Queue

**Required Info:**
 - AWS ParallelCluster version [e.g. 3.1.1]:3.1.4
 - Full cluster configuration without any credentials or personal data.
```
Region: us-east-1

Image:
  Os: alinux2

HeadNode:
  InstanceType: m5.xlarge
  Networking:
    SubnetId: subnet-0aa9d3bd709f86d50
    SecurityGroups:
    - sg-045290f659c3be158
  Ssh:
    KeyName: pcluster
  LocalStorage:
    RootVolume:
      Size: 600
      VolumeType: gp3
  Iam:
    AdditionalIamPolicies:
    - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

Scheduling:
  Scheduler: slurm
  SlurmSettings:
    Dns:
      DisableManagedDns: true
      UseEc2Hostnames: true
  SlurmQueues:
  - Name: oar
    ComputeResources:
    - Name: m5xlarge
      InstanceType: m5.xlarge
      MinCount: 2
      MaxCount: 10
    Networking:
      SubnetIds:
      - subnet-0aa9d3bd709f86d50
      SecurityGroups:
      - sg-045290f659c3be158
    Iam:
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  SlurmQueues:
  - Name: oarm6id24xlarge
    ComputeSettings:
      LocalStorage:
        RootVolume:
          Size: 500
    ComputeResources:
    - Name: m6id24xlarge
      InstanceType: m6id.24xlarge
      MinCount: 0
      MaxCount: 2
    Networking:
      SubnetIds:
      - subnet-0aa9d3bd709f86d50
      SecurityGroups:
      - sg-045290f659c3be158
    Iam:
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

SharedStorage:
- MountDir: /share
  Name: efs-covid
  StorageType: Efs
  EfsSettings:
    FileSystemId: fs-7fce8efd

#AdditionalPackages:
#  IntelSoftware:
#    IntelHpcPlatform: true

DirectoryService:
  DomainName: dc=ncisdev,dc=noaa
  DomainAddr: ldap://10.101.14.78,ldap://10.101.9.221
  PasswordSecretArn: arn:aws:secretsmanager:us-east-1:716453263077:secret:MicrosoftAD.Admin.Password-gDGZv6
  DomainReadOnlyUser: cn=adjoin,ou=service,ou=NCISDEV,dc=ncisdev,dc=noaa
  AdditionalSssdConfigs:
    ldap_auth_disable_tls_never_use_in_production: True

Tags:
  - Key: noaa:environment
    Value: dev
  - Key: noaa:fismaid
    Value: noaa5006
  - Key: noaa:lineoffice
    Value: nesdis
  - Key: noaa:programoffice
    Value: 40-00
  - Key: noaa:taskerorderid
    Value: 13051420fneea0147
  - Key: noaa:application
    Value: OAR WRF-Chem

```
 - Cluster name: oar-pcluster

- Output of `pcluster describe-cluster` command.

```
{
  "creationTime": "2022-06-30T19:35:47.455Z",
  "headNode": {
    "launchTime": "2022-06-30T19:38:44.000Z",
    "instanceId": "i-00c4711259d394ae3",
    "instanceType": "m5.xlarge",
    "state": "running",
    "privateIpAddress": "10.102.8.85"
  },
  "version": "3.1.4",
  "clusterConfiguration": {
    "url": "https://parallelcluster-e2ca1557272da85b-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.1.4/clusters/oar-pcluster-wtisscfpi6pepclv/configs/cluster-config.yaml?versionId=ZkTmb6RC_GHCxLLymscOGWLSeh0womE4&AWSAccessKeyId=AKIA2NT7RG3SULLIW55M&Signature=EPFIQnc8YB%2BsP3CunBz3WBPvLfY%3D&Expires=1661212485"
  },
  "tags": [
    {
      "value": "13051420fneea0147",
      "key": "noaa:taskerorderid"
    },
    {
      "value": "3.1.4",
      "key": "parallelcluster:version"
    },
    {
      "value": "noaa5006",
      "key": "noaa:fismaid"
    },
    {
      "value": "dev",
      "key": "noaa:environment"
    },
    {
      "value": "OAR WRF-Chem",
      "key": "noaa:application"
    },
    {
      "value": "nesdis",
      "key": "noaa:lineoffice"
    },
    {
      "value": "40-00",
      "key": "noaa:programoffice"
    }
  ],
  "cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
  "clusterName": "oar-pcluster",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:716453263077:stack/oar-pcluster/d7077910-f8ab-11ec-aa98-0e38e0433449",
  "lastUpdatedTime": "2022-08-22T22:34:59.022Z",
  "region": "us-east-1",
  "clusterStatus": "UPDATE_FAILED"
}

``` 
 - [Optional] Arn of the cluster CloudFormation main stack:

**Bug description and how to reproduce:**
When trying to use to update the already existing cluster with the configuration above I get an error.

`Interface: [eni-0966644c5bb6b0347] in use. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidNetworkInterface.InUse; Request ID: 557b2408-e32b-4d37-adfd-a0332a8c0ce9; Proxy: null)`

The only thing that is changing from the original configuration is the addition of the Slurm queue oarm6id24xlarge.

**If you are reporting issues about scaling or job failure:**
We cannot work on issues without proper logs. We **STRONGLY** recommend following [this guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-get-logs) and attach the complete cluster log archive with the ticket.

For issues with Slurm scheduler, please attach the following logs:
* From Head node: `/var/log/parallelcluster/clustermgtd`, `/var/log/parallelcluster/clusterstatusmgtd` (if version >= 3.2.0), `/var/log/parallelcluster/slurm_resume.log`, `/var/log/parallelcluster/slurm_suspend.log`, `/var/log/parallelcluster/slurm_fleet_status_manager.log` (if version >= 3.2.0) and`/var/log/slurmctld.log`. 
* From Compute node:  `/var/log/parallelcluster/computemgtd.log` and `/var/log/slurmd.log`.

**If you are reporting issues about cluster creation failure or node failure:**

If the cluster fails creation, please re-execute `create-cluster` action using `--rollback-on-failure false` option.

We cannot work on issues without proper logs. We **STRONGLY** recommend following [this guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-get-logs) and attach the complete cluster log archive with the ticket.

Please be sure to attach the following logs:
* From Head node: `/var/log/cloud-init.log`, `/var/log/cfn-init.log` and `/var/log/chef-client.log`
* From Compute node:  `/var/log/cloud-init-output.log`.

**Additional context:**
Any other context about the problem. E.g.:
 - CLI logs: `~/.parallelcluster/pcluster-cli.log`
 - Custom bootstrap scripts, if any
 - Screenshots, if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster Update Failure When Adding a New Slurm Queue #4286

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster Update Failure When Adding a New Slurm Queue #4286

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions