Skip to content

Cluster Update Failure When Adding a New Slurm Queue #4286

Open
@joehellmersNOAA

Description

@joehellmersNOAA

Required Info:

  • AWS ParallelCluster version [e.g. 3.1.1]:3.1.4
  • Full cluster configuration without any credentials or personal data.
Region: us-east-1

Image:
  Os: alinux2

HeadNode:
  InstanceType: m5.xlarge
  Networking:
    SubnetId: subnet-0aa9d3bd709f86d50
    SecurityGroups:
    - sg-045290f659c3be158
  Ssh:
    KeyName: pcluster
  LocalStorage:
    RootVolume:
      Size: 600
      VolumeType: gp3
  Iam:
    AdditionalIamPolicies:
    - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

Scheduling:
  Scheduler: slurm
  SlurmSettings:
    Dns:
      DisableManagedDns: true
      UseEc2Hostnames: true
  SlurmQueues:
  - Name: oar
    ComputeResources:
    - Name: m5xlarge
      InstanceType: m5.xlarge
      MinCount: 2
      MaxCount: 10
    Networking:
      SubnetIds:
      - subnet-0aa9d3bd709f86d50
      SecurityGroups:
      - sg-045290f659c3be158
    Iam:
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  SlurmQueues:
  - Name: oarm6id24xlarge
    ComputeSettings:
      LocalStorage:
        RootVolume:
          Size: 500
    ComputeResources:
    - Name: m6id24xlarge
      InstanceType: m6id.24xlarge
      MinCount: 0
      MaxCount: 2
    Networking:
      SubnetIds:
      - subnet-0aa9d3bd709f86d50
      SecurityGroups:
      - sg-045290f659c3be158
    Iam:
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

SharedStorage:
- MountDir: /share
  Name: efs-covid
  StorageType: Efs
  EfsSettings:
    FileSystemId: fs-7fce8efd

#AdditionalPackages:
#  IntelSoftware:
#    IntelHpcPlatform: true

DirectoryService:
  DomainName: dc=ncisdev,dc=noaa
  DomainAddr: ldap://10.101.14.78,ldap://10.101.9.221
  PasswordSecretArn: arn:aws:secretsmanager:us-east-1:716453263077:secret:MicrosoftAD.Admin.Password-gDGZv6
  DomainReadOnlyUser: cn=adjoin,ou=service,ou=NCISDEV,dc=ncisdev,dc=noaa
  AdditionalSssdConfigs:
    ldap_auth_disable_tls_never_use_in_production: True

Tags:
  - Key: noaa:environment
    Value: dev
  - Key: noaa:fismaid
    Value: noaa5006
  - Key: noaa:lineoffice
    Value: nesdis
  - Key: noaa:programoffice
    Value: 40-00
  - Key: noaa:taskerorderid
    Value: 13051420fneea0147
  - Key: noaa:application
    Value: OAR WRF-Chem

  • Cluster name: oar-pcluster

  • Output of pcluster describe-cluster command.

{
  "creationTime": "2022-06-30T19:35:47.455Z",
  "headNode": {
    "launchTime": "2022-06-30T19:38:44.000Z",
    "instanceId": "i-00c4711259d394ae3",
    "instanceType": "m5.xlarge",
    "state": "running",
    "privateIpAddress": "10.102.8.85"
  },
  "version": "3.1.4",
  "clusterConfiguration": {
    "url": "https://parallelcluster-e2ca1557272da85b-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.1.4/clusters/oar-pcluster-wtisscfpi6pepclv/configs/cluster-config.yaml?versionId=ZkTmb6RC_GHCxLLymscOGWLSeh0womE4&AWSAccessKeyId=AKIA2NT7RG3SULLIW55M&Signature=EPFIQnc8YB%2BsP3CunBz3WBPvLfY%3D&Expires=1661212485"
  },
  "tags": [
    {
      "value": "13051420fneea0147",
      "key": "noaa:taskerorderid"
    },
    {
      "value": "3.1.4",
      "key": "parallelcluster:version"
    },
    {
      "value": "noaa5006",
      "key": "noaa:fismaid"
    },
    {
      "value": "dev",
      "key": "noaa:environment"
    },
    {
      "value": "OAR WRF-Chem",
      "key": "noaa:application"
    },
    {
      "value": "nesdis",
      "key": "noaa:lineoffice"
    },
    {
      "value": "40-00",
      "key": "noaa:programoffice"
    }
  ],
  "cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
  "clusterName": "oar-pcluster",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:716453263077:stack/oar-pcluster/d7077910-f8ab-11ec-aa98-0e38e0433449",
  "lastUpdatedTime": "2022-08-22T22:34:59.022Z",
  "region": "us-east-1",
  "clusterStatus": "UPDATE_FAILED"
}

  • [Optional] Arn of the cluster CloudFormation main stack:

Bug description and how to reproduce:
When trying to use to update the already existing cluster with the configuration above I get an error.

Interface: [eni-0966644c5bb6b0347] in use. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidNetworkInterface.InUse; Request ID: 557b2408-e32b-4d37-adfd-a0332a8c0ce9; Proxy: null)

The only thing that is changing from the original configuration is the addition of the Slurm queue oarm6id24xlarge.

If you are reporting issues about scaling or job failure:
We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.

For issues with Slurm scheduler, please attach the following logs:

  • From Head node: /var/log/parallelcluster/clustermgtd, /var/log/parallelcluster/clusterstatusmgtd (if version >= 3.2.0), /var/log/parallelcluster/slurm_resume.log, /var/log/parallelcluster/slurm_suspend.log, /var/log/parallelcluster/slurm_fleet_status_manager.log (if version >= 3.2.0) and/var/log/slurmctld.log.
  • From Compute node: /var/log/parallelcluster/computemgtd.log and /var/log/slurmd.log.

If you are reporting issues about cluster creation failure or node failure:

If the cluster fails creation, please re-execute create-cluster action using --rollback-on-failure false option.

We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.

Please be sure to attach the following logs:

  • From Head node: /var/log/cloud-init.log, /var/log/cfn-init.log and /var/log/chef-client.log
  • From Compute node: /var/log/cloud-init-output.log.

Additional context:
Any other context about the problem. E.g.:

  • CLI logs: ~/.parallelcluster/pcluster-cli.log
  • Custom bootstrap scripts, if any
  • Screenshots, if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions