Description
Required Info:
- AWS ParallelCluster version [e.g. 3.1.1]:3.1.4
- Full cluster configuration without any credentials or personal data.
Region: us-east-1
Image:
Os: alinux2
HeadNode:
InstanceType: m5.xlarge
Networking:
SubnetId: subnet-0aa9d3bd709f86d50
SecurityGroups:
- sg-045290f659c3be158
Ssh:
KeyName: pcluster
LocalStorage:
RootVolume:
Size: 600
VolumeType: gp3
Iam:
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Scheduling:
Scheduler: slurm
SlurmSettings:
Dns:
DisableManagedDns: true
UseEc2Hostnames: true
SlurmQueues:
- Name: oar
ComputeResources:
- Name: m5xlarge
InstanceType: m5.xlarge
MinCount: 2
MaxCount: 10
Networking:
SubnetIds:
- subnet-0aa9d3bd709f86d50
SecurityGroups:
- sg-045290f659c3be158
Iam:
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
SlurmQueues:
- Name: oarm6id24xlarge
ComputeSettings:
LocalStorage:
RootVolume:
Size: 500
ComputeResources:
- Name: m6id24xlarge
InstanceType: m6id.24xlarge
MinCount: 0
MaxCount: 2
Networking:
SubnetIds:
- subnet-0aa9d3bd709f86d50
SecurityGroups:
- sg-045290f659c3be158
Iam:
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
SharedStorage:
- MountDir: /share
Name: efs-covid
StorageType: Efs
EfsSettings:
FileSystemId: fs-7fce8efd
#AdditionalPackages:
# IntelSoftware:
# IntelHpcPlatform: true
DirectoryService:
DomainName: dc=ncisdev,dc=noaa
DomainAddr: ldap://10.101.14.78,ldap://10.101.9.221
PasswordSecretArn: arn:aws:secretsmanager:us-east-1:716453263077:secret:MicrosoftAD.Admin.Password-gDGZv6
DomainReadOnlyUser: cn=adjoin,ou=service,ou=NCISDEV,dc=ncisdev,dc=noaa
AdditionalSssdConfigs:
ldap_auth_disable_tls_never_use_in_production: True
Tags:
- Key: noaa:environment
Value: dev
- Key: noaa:fismaid
Value: noaa5006
- Key: noaa:lineoffice
Value: nesdis
- Key: noaa:programoffice
Value: 40-00
- Key: noaa:taskerorderid
Value: 13051420fneea0147
- Key: noaa:application
Value: OAR WRF-Chem
-
Cluster name: oar-pcluster
-
Output of
pcluster describe-cluster
command.
{
"creationTime": "2022-06-30T19:35:47.455Z",
"headNode": {
"launchTime": "2022-06-30T19:38:44.000Z",
"instanceId": "i-00c4711259d394ae3",
"instanceType": "m5.xlarge",
"state": "running",
"privateIpAddress": "10.102.8.85"
},
"version": "3.1.4",
"clusterConfiguration": {
"url": "https://parallelcluster-e2ca1557272da85b-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.1.4/clusters/oar-pcluster-wtisscfpi6pepclv/configs/cluster-config.yaml?versionId=ZkTmb6RC_GHCxLLymscOGWLSeh0womE4&AWSAccessKeyId=AKIA2NT7RG3SULLIW55M&Signature=EPFIQnc8YB%2BsP3CunBz3WBPvLfY%3D&Expires=1661212485"
},
"tags": [
{
"value": "13051420fneea0147",
"key": "noaa:taskerorderid"
},
{
"value": "3.1.4",
"key": "parallelcluster:version"
},
{
"value": "noaa5006",
"key": "noaa:fismaid"
},
{
"value": "dev",
"key": "noaa:environment"
},
{
"value": "OAR WRF-Chem",
"key": "noaa:application"
},
{
"value": "nesdis",
"key": "noaa:lineoffice"
},
{
"value": "40-00",
"key": "noaa:programoffice"
}
],
"cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
"clusterName": "oar-pcluster",
"computeFleetStatus": "RUNNING",
"cloudformationStackArn": "arn:aws:cloudformation:us-east-1:716453263077:stack/oar-pcluster/d7077910-f8ab-11ec-aa98-0e38e0433449",
"lastUpdatedTime": "2022-08-22T22:34:59.022Z",
"region": "us-east-1",
"clusterStatus": "UPDATE_FAILED"
}
- [Optional] Arn of the cluster CloudFormation main stack:
Bug description and how to reproduce:
When trying to use to update the already existing cluster with the configuration above I get an error.
Interface: [eni-0966644c5bb6b0347] in use. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidNetworkInterface.InUse; Request ID: 557b2408-e32b-4d37-adfd-a0332a8c0ce9; Proxy: null)
The only thing that is changing from the original configuration is the addition of the Slurm queue oarm6id24xlarge.
If you are reporting issues about scaling or job failure:
We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.
For issues with Slurm scheduler, please attach the following logs:
- From Head node:
/var/log/parallelcluster/clustermgtd
,/var/log/parallelcluster/clusterstatusmgtd
(if version >= 3.2.0),/var/log/parallelcluster/slurm_resume.log
,/var/log/parallelcluster/slurm_suspend.log
,/var/log/parallelcluster/slurm_fleet_status_manager.log
(if version >= 3.2.0) and/var/log/slurmctld.log
. - From Compute node:
/var/log/parallelcluster/computemgtd.log
and/var/log/slurmd.log
.
If you are reporting issues about cluster creation failure or node failure:
If the cluster fails creation, please re-execute create-cluster
action using --rollback-on-failure false
option.
We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.
Please be sure to attach the following logs:
- From Head node:
/var/log/cloud-init.log
,/var/log/cfn-init.log
and/var/log/chef-client.log
- From Compute node:
/var/log/cloud-init-output.log
.
Additional context:
Any other context about the problem. E.g.:
- CLI logs:
~/.parallelcluster/pcluster-cli.log
- Custom bootstrap scripts, if any
- Screenshots, if useful.