Description
Required Info:
- AWS ParallelCluster version [e.g. 3.1.1]: 3.13.0 with base PCluster AL2023 AMI
- FSx lustre configuration from PCluster YAML:
- Name: fs
StorageType: FsxLustre
MountDir: [redacted]
FsxLustreSettings:
StorageCapacity: 28800
DeploymentType: PERSISTENT_1
PerUnitStorageThroughput: 200
ImportedFileChunkSize: 1024
ImportPath: [redacted]
WeeklyMaintenanceStartTime: "6:00:00"
Bug description and how to reproduce:
In our cluster, we spin up an Persistent 1 FSx for Lustre file system that is hooked up to an S3 bucket with ~345GB across ~37,000 objects. During our post install procedures, we warm the FSx lustre file system using find [...] -type f -print0 | xargs -0 --max-args=1 --max-procs=16 sudo lfs hsm_restore
as executed from a head node instance that is a c7a.8xlarge.
This hsm_restore process typically takes 5-6 minutes to warm all 345GB and is consistent in timing from PCluster version 3.8.0 through 3.12.0. However, with PCluster 3.13.0, the hsm_restore process never completes (we've let it run 1 hour+) and times out the PCluster CloudFormation stack with "Create Failed" due to the HeadNode timeout. Looking at the FSx metrics, it looks like all the hsm_restore processes are not running on the file system, as the IOPS and throughput of the file system are effectively in "idle", with very little IOPS or network throughput. Further, it looks like the hsm_restore processes immediately go to "sleep" mode on the HeadNode. We also noticed that the lustre client version went from 2.15.3 in PCluster 3.12.0 to 2.15.6 in PCluster 3.13.0.
We are hoping you can provide some insight into this as we currently can't spin up clusters using PCluster 3.13.0 without removing the hsm_restore commands due to its extremely degraded performance. Thanks! - Stefan