PCluster 3.13.0 lfs hsm_restore degraded performance

**Required Info:**
 - AWS ParallelCluster version [e.g. 3.1.1]: 3.13.0 with base PCluster AL2023 AMI
 - FSx lustre configuration from PCluster YAML:
```
  - Name: fs
    StorageType: FsxLustre
    MountDir: [redacted]
    FsxLustreSettings:
      StorageCapacity: 28800
      DeploymentType: PERSISTENT_1
      PerUnitStorageThroughput: 200
      ImportedFileChunkSize: 1024
      ImportPath: [redacted]
      WeeklyMaintenanceStartTime: "6:00:00"
```
**Bug description and how to reproduce:**
In our cluster, we spin up an Persistent 1 FSx for Lustre file system that is hooked up to an S3 bucket with ~345GB across ~37,000 objects. During our post install procedures, we warm the FSx lustre file system using `find [...] -type f -print0 | xargs -0 --max-args=1 --max-procs=16 sudo lfs hsm_restore` as executed from a head node instance that is a c7a.8xlarge.

This hsm_restore process typically takes 5-6 minutes to warm all 345GB and is consistent in timing from PCluster version 3.8.0 through 3.12.0. However, with PCluster 3.13.0, the hsm_restore process never completes (we've let it run 1 hour+) and times out the PCluster CloudFormation stack with "Create Failed" due to the HeadNode timeout. Looking at the FSx metrics, it looks like all the hsm_restore processes are not running on the file system, as the IOPS and throughput of the file system are effectively in "idle", with very little IOPS or network throughput. Further, it looks like the hsm_restore processes immediately go to "sleep" mode on the HeadNode. We also noticed that the lustre client version went from 2.15.3 in PCluster 3.12.0 to 2.15.6 in PCluster 3.13.0.

We are hoping you can provide some insight into this as we currently can't spin up clusters using PCluster 3.13.0 without removing the hsm_restore commands due to its extremely degraded performance. Thanks! - Stefan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PCluster 3.13.0 lfs hsm_restore degraded performance #6799

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PCluster 3.13.0 lfs hsm_restore degraded performance #6799

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions