parallelcluster 3.11.1: "srun failure: node error not ready"

parallelcluster 3.11.1
Deploying with https://github.com/aws-samples/aws-eda-slurm-cluster

Every so often, srun fails with an error message:
```
srun: error: Node failure on od-r7i-xl-dy-od-32-gb-2-cores-3
srun: error: Nodes od-r7i-xl-dy-od-32-gb-2-cores-3 are still not ready
srun: error: Something is wrong with the boot of the nodes.
```

Here is what I've analyzed from the slurmctld.log -- attached.
In the slurmctld.log I can see the job 1338386 is allocated an on-demand node od-r7i-xl-dy-od-32-gb-2-cores-3 on 2025-05-20 at 23:30:41 and then the node is killed slight less than four minutes later due to Scheduler health check failed:

```
[2025-05-20T23:30:41.412] sched: _slurm_rpc_allocate_resources JobId=1338386 NodeList=od-r7i-xl-dy-od-32-gb-2-cores-3 usec=3682
[2025-05-20T23:30:52.005] POWER: no more nodes to resume for job JobId=1338386
[2025-05-20T23:30:52.005] POWER: power_save: waking nodes od-r7a-4xl-dy-od-128-gb-16-cores-2,od-r7i-xl-dy-od-32-gb-2-cores-3
...
[2025-05-20T23:34:40.465] update_node: node od-r7i-xl-dy-od-32-gb-2-cores-3 reason set to: Scheduler health check failed
[2025-05-20T23:34:40.465] Killing JobId=1338386 on failed node od-r7i-xl-dy-od-32-gb-2-cores-3
[2025-05-20T23:34:40.465] powering down node od-r7i-xl-dy-od-32-gb-2-cores-3
```

These machines take anywhere from 4-7 minutes to boot. Many of my srun commands work just fine. Just occasionally do my users see this error.

[slurmctld.log](https://github.com/user-attachments/files/20429842/slurmctld.log)

I do see the following set in my slurm.conf:
```
SlurmctldTimeout=300
SlurmdTimeout=180
```
though I'm not sure how that ties into Health check failing..

I do see lots of Scheduler health check failed messages, now that I am looking for them.  But we don't "notice" them as most of my interaction is with sbatch and those jobs just get requeued.  But srun, you end up with a fail.

Why is slurm checking for the health of the node so quickly.  Is there anyway to change this timeout? I could not find a configuration variable.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parallelcluster 3.11.1: "srun failure: node error not ready" #6842

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

parallelcluster 3.11.1: "srun failure: node error not ready" #6842

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions