Description
parallelcluster 3.11.1
Deploying with https://github.com/aws-samples/aws-eda-slurm-cluster
Every so often, srun fails with an error message:
srun: error: Node failure on od-r7i-xl-dy-od-32-gb-2-cores-3
srun: error: Nodes od-r7i-xl-dy-od-32-gb-2-cores-3 are still not ready
srun: error: Something is wrong with the boot of the nodes.
Here is what I've analyzed from the slurmctld.log -- attached.
In the slurmctld.log I can see the job 1338386 is allocated an on-demand node od-r7i-xl-dy-od-32-gb-2-cores-3 on 2025-05-20 at 23:30:41 and then the node is killed slight less than four minutes later due to Scheduler health check failed:
[2025-05-20T23:30:41.412] sched: _slurm_rpc_allocate_resources JobId=1338386 NodeList=od-r7i-xl-dy-od-32-gb-2-cores-3 usec=3682
[2025-05-20T23:30:52.005] POWER: no more nodes to resume for job JobId=1338386
[2025-05-20T23:30:52.005] POWER: power_save: waking nodes od-r7a-4xl-dy-od-128-gb-16-cores-2,od-r7i-xl-dy-od-32-gb-2-cores-3
...
[2025-05-20T23:34:40.465] update_node: node od-r7i-xl-dy-od-32-gb-2-cores-3 reason set to: Scheduler health check failed
[2025-05-20T23:34:40.465] Killing JobId=1338386 on failed node od-r7i-xl-dy-od-32-gb-2-cores-3
[2025-05-20T23:34:40.465] powering down node od-r7i-xl-dy-od-32-gb-2-cores-3
These machines take anywhere from 4-7 minutes to boot. Many of my srun commands work just fine. Just occasionally do my users see this error.
I do see the following set in my slurm.conf:
SlurmctldTimeout=300
SlurmdTimeout=180
though I'm not sure how that ties into Health check failing..
I do see lots of Scheduler health check failed messages, now that I am looking for them. But we don't "notice" them as most of my interaction is with sbatch and those jobs just get requeued. But srun, you end up with a fail.
Why is slurm checking for the health of the node so quickly. Is there anyway to change this timeout? I could not find a configuration variable.