Skip to content

(2.8.1 and earlier) Slurm node list misconfiguration breaks termination of idle compute nodes

Francesco De Martino edited this page Dec 21, 2021 · 4 revisions

The issue

When using Slurm scheduler with ParallelCluster <2.9.0, in case the Slurm node list configuration gets corrupted, the compute fleet management daemons might be unable to terminate idle nodes, thus resulting in unexpected costs. Slurm misconfigurations could be either caused by user modifying the scheduler configuration or could be the result of system errors that break the ParallelCluster management logic. For example this could happen when the file system is full and the sqswatcher daemon is unable to properly update the scheduler configuration.

Affected ParallelCluster versions

This issue affects all versions of ParallelCluster from v2.3.1 to v2.8.1. In ParallelCluster 2.9.0, released on Sep 2020, we enhanced the approach for Slurm scaling and addressed such corner case.

Error details

To identify if a node is idle and can be self-terminated, the nodewatcher (the daemon running on Slurm compute nodes for ParallelCluster <2.9.0) reads the output of the squeue command.

A wrong Slurm configuration is translated into an error message returned in the output of the squeue command that is not correctly parsed by the nodewatcher. Due to this the node is marked as active, even if there are no jobs running on it or pending jobs.

For example a no left space on device system error might interfere with the execution of the commands performed to scale down the cluster, causing an incomplete cleanup of the hostnames associated with terminated instances. When this happens the output of the squeue command presents the following message: squeue: error: Duplicated NodeHostName ip-10-31-18-157 in the config file.

It is possible to identify the faulty behavior by looking at the /var/log/nodewatcher file on the compute nodes. When the problem occurs the log file will present a log message saying Instance has active jobs even if, according to the previous log line, there are no running jobs but instead the error message from the squeue command.

2021-12-01 09:39:45,389 INFO [slurm:has_jobs] Found the following running jobs:
squeue: error: Duplicated NodeHostName ip-10-31-18-157 in the config file
2021-12-01 09:39:45,389 INFO [nodewatcher:_poll_instance_status] Instance has active jobs.

The solution

It is highly recommended to delete the cluster and create a new one with an updated ParallelCluster version as soon as possible. Please take a look at AWS ParallelCluster Support Policy for end of support for older versions and deprecation strategy.

Clone this wiki locally