Skip to content

parallelcluster 3.11.1: "srun failure: node error not ready" #6842

Open
@gwolski

Description

@gwolski

parallelcluster 3.11.1
Deploying with https://github.com/aws-samples/aws-eda-slurm-cluster

Every so often, srun fails with an error message:

srun: error: Node failure on od-r7i-xl-dy-od-32-gb-2-cores-3
srun: error: Nodes od-r7i-xl-dy-od-32-gb-2-cores-3 are still not ready
srun: error: Something is wrong with the boot of the nodes.

Here is what I've analyzed from the slurmctld.log -- attached.
In the slurmctld.log I can see the job 1338386 is allocated an on-demand node od-r7i-xl-dy-od-32-gb-2-cores-3 on 2025-05-20 at 23:30:41 and then the node is killed slight less than four minutes later due to Scheduler health check failed:

[2025-05-20T23:30:41.412] sched: _slurm_rpc_allocate_resources JobId=1338386 NodeList=od-r7i-xl-dy-od-32-gb-2-cores-3 usec=3682
[2025-05-20T23:30:52.005] POWER: no more nodes to resume for job JobId=1338386
[2025-05-20T23:30:52.005] POWER: power_save: waking nodes od-r7a-4xl-dy-od-128-gb-16-cores-2,od-r7i-xl-dy-od-32-gb-2-cores-3
...
[2025-05-20T23:34:40.465] update_node: node od-r7i-xl-dy-od-32-gb-2-cores-3 reason set to: Scheduler health check failed
[2025-05-20T23:34:40.465] Killing JobId=1338386 on failed node od-r7i-xl-dy-od-32-gb-2-cores-3
[2025-05-20T23:34:40.465] powering down node od-r7i-xl-dy-od-32-gb-2-cores-3

These machines take anywhere from 4-7 minutes to boot. Many of my srun commands work just fine. Just occasionally do my users see this error.

slurmctld.log

I do see the following set in my slurm.conf:

SlurmctldTimeout=300
SlurmdTimeout=180

though I'm not sure how that ties into Health check failing..

I do see lots of Scheduler health check failed messages, now that I am looking for them. But we don't "notice" them as most of my interaction is with sbatch and those jobs just get requeued. But srun, you end up with a fail.

Why is slurm checking for the health of the node so quickly. Is there anyway to change this timeout? I could not find a configuration variable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions