Skip to content

PCluster 3.13.0 lfs hsm_restore degraded performance #6799

Open
@stefan-vaisala

Description

@stefan-vaisala

Required Info:

  • AWS ParallelCluster version [e.g. 3.1.1]: 3.13.0 with base PCluster AL2023 AMI
  • FSx lustre configuration from PCluster YAML:
  - Name: fs
    StorageType: FsxLustre
    MountDir: [redacted]
    FsxLustreSettings:
      StorageCapacity: 28800
      DeploymentType: PERSISTENT_1
      PerUnitStorageThroughput: 200
      ImportedFileChunkSize: 1024
      ImportPath: [redacted]
      WeeklyMaintenanceStartTime: "6:00:00"

Bug description and how to reproduce:
In our cluster, we spin up an Persistent 1 FSx for Lustre file system that is hooked up to an S3 bucket with ~345GB across ~37,000 objects. During our post install procedures, we warm the FSx lustre file system using find [...] -type f -print0 | xargs -0 --max-args=1 --max-procs=16 sudo lfs hsm_restore as executed from a head node instance that is a c7a.8xlarge.

This hsm_restore process typically takes 5-6 minutes to warm all 345GB and is consistent in timing from PCluster version 3.8.0 through 3.12.0. However, with PCluster 3.13.0, the hsm_restore process never completes (we've let it run 1 hour+) and times out the PCluster CloudFormation stack with "Create Failed" due to the HeadNode timeout. Looking at the FSx metrics, it looks like all the hsm_restore processes are not running on the file system, as the IOPS and throughput of the file system are effectively in "idle", with very little IOPS or network throughput. Further, it looks like the hsm_restore processes immediately go to "sleep" mode on the HeadNode. We also noticed that the lustre client version went from 2.15.3 in PCluster 3.12.0 to 2.15.6 in PCluster 3.13.0.

We are hoping you can provide some insight into this as we currently can't spin up clusters using PCluster 3.13.0 without removing the hsm_restore commands due to its extremely degraded performance. Thanks! - Stefan

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions