Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.
- Prometheus Slurm Exporter π
- β Exports a wide range of metrics from Slurm, including nodes, partitions, jobs, CPUs, and GPUs.
- β All metric collectors are optional and can be enabled/disabled via flags.
- β Supports TLS and Basic Authentication for secure connections.
- β Ready-to-use Grafana dashboard.
There are two recommended ways to install the Slurm Exporter.
This is the easiest method for most users.
-
Download the latest release for your OS and architecture from the GitHub Releases page. π₯
-
Place the
slurm_exporter
binary in a suitable location on a node with Slurm CLI access, such as/usr/local/bin/
. -
Ensure the binary is executable:
chmod +x /usr/local/bin/slurm_exporter
-
(Optional) To run the exporter as a service, you can adapt the example Systemd unit file provided in this repository at systemd/slurm_exporter.service.
-
Copy it to
/etc/systemd/system/slurm_exporter.service
and customize it for your environment (especially theExecStart
path). -
Reload the Systemd daemon, then enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable slurm_exporter sudo systemctl start slurm_exporter
-
If you want to build the exporter yourself, you can do so using the provided Makefile. π©βπ»
-
Clone the repository:
git clone https://github.com/sckyzo/slurm_exporter.git cd slurm_exporter
-
Build the binary:
make build
-
The new binary will be available at
bin/slurm_exporter
. You can then copy it to a location like/usr/local/bin/
and set up the Systemd service as described in the section above.
The exporter can be configured using command-line flags.
Basic execution:
./slurm_exporter --web.listen-address=":9341"
Using a configuration file for web settings (TLS/Basic Auth):
./slurm_exporter --web.config.file=/path/to/web-config.yml
For details on the web-config.yml
format, see the Exporter Toolkit documentation.
View help and all available options:
./slurm_exporter --help
Flag | Description | Default |
---|---|---|
--web.listen-address |
Address to listen on for web interface and telemetry | :9341 |
--web.config.file |
Path to configuration file for TLS/Basic Auth | (none) |
--command.timeout |
Timeout for executing Slurm commands | 5s |
--log.level |
Log level: debug , info , warn , error |
info |
--log.format |
Log format: json , text |
text |
--collector.<name> |
Enable the specified collector | true (all enabled by default) |
--no-collector.<name> |
Disable the specified collector | (none) |
Available collectors: accounts
, cpus
, fairshare
, gpus
, info
, node
, nodes
, partitions
, queue
, reservations
, scheduler
, users
By default, all collectors are enabled.
You can control which collectors are active using the --collector.<name>
and --no-collector.<name>
flags.
Example: Disable the scheduler
and partitions
collectors
./slurm_exporter --no-collector.scheduler --no-collector.partitions
Example: Disable the gpus
collector
./slurm_exporter --no-collector.gpus
Example: Run only the nodes
and cpus
collectors
This requires disabling all other collectors individually.
./slurm_exporter \
--no-collector.accounts \
--no-collector.fairshare \
--no-collector.gpus \
--no-collector.node \
--no-collector.partitions \
--no-collector.queue \
--no-collector.reservations \
--no-collector.scheduler \
--no-collector.info \
--no-collector.users
Example: Custom timeout and logging
./slurm_exporter \
--command.timeout=10s \
--log.level=debug \
--log.format=json
This project requires access to a node with the Slurm CLI (sinfo
, squeue
, sdiag
, etc.).
- Go (version 1.22 or higher recommended)
- Slurm CLI tools available in your
$PATH
-
Clone this repository:
git clone https://github.com/sckyzo/slurm_exporter.git cd slurm_exporter
-
Build the exporter binary:
make build
The binary will be available in
bin/slurm_exporter
.
To run all tests:
make test
Clean build artifacts:
make clean
Run the exporter locally:
bin/slurm_exporter --web.listen-address=:8080
Query metrics:
curl http://localhost:8080/metrics
Advanced build options: You can override the Go version and architecture via environment variables:
make build GO_VERSION=1.22.2 OS=linux ARCH=amd64
The exporter provides a wide range of metrics, each collected by a specific, toggleable collector.
Provides job statistics aggregated by Slurm account.
- Command:
squeue -a -r -h -o "%A|%a|%T|%C"
Metric | Description | Labels |
---|---|---|
slurm_account_jobs_pending |
Pending jobs for account | account |
slurm_account_jobs_running |
Running jobs for account | account |
slurm_account_cpus_running |
Running cpus for account | account |
slurm_account_jobs_suspended |
Suspended jobs for account | account |
Provides global statistics on CPU states for the entire cluster.
- Command:
sinfo -h -o "%C"
Metric | Description | Labels |
---|---|---|
slurm_cpus_alloc |
Allocated CPUs | (none) |
slurm_cpus_idle |
Idle CPUs | (none) |
slurm_cpus_other |
Mix CPUs | (none) |
slurm_cpus_total |
Total CPUs | (none) |
Reports the calculated fairshare factor for each account.
- Command:
sshare -n -P -o "account,fairshare"
Metric | Description | Labels |
---|---|---|
slurm_account_fairshare |
FairShare for account | account |
Provides global statistics on GPU states for the entire cluster.
β οΈ Note: This collector is enabled by default. Disable it with--no-collector.gpus
if not needed.
- Command:
sinfo
(with various formats)
Metric | Description | Labels |
---|---|---|
slurm_gpus_alloc |
Allocated GPUs | (none) |
slurm_gpus_idle |
Idle GPUs | (none) |
slurm_gpus_other |
Other GPUs | (none) |
slurm_gpus_total |
Total GPUs | (none) |
slurm_gpus_utilization |
Total GPU utilization | (none) |
Exposes the version of Slurm and the availability of different Slurm binaries.
- Command:
<binary> --version
Metric | Description | Labels |
---|---|---|
slurm_info |
Information on Slurm version and binaries | type , binary , version |
Provides detailed, per-node metrics for CPU and memory usage.
- Command:
sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong,Partition"
Metric | Description | Labels |
---|---|---|
slurm_node_cpu_alloc |
Allocated CPUs per node | node , status , partition |
slurm_node_cpu_idle |
Idle CPUs per node | node , status , partition |
slurm_node_cpu_other |
Other CPUs per node | node , status , partition |
slurm_node_cpu_total |
Total CPUs per node | node , status , partition |
slurm_node_mem_alloc |
Allocated memory per node | node , status , partition |
slurm_node_mem_total |
Total memory per node | node , status , partition |
slurm_node_status |
Node Status with partition (1 if up) | node , status , partition |
Provides aggregated metrics on node states for the cluster.
- Commands:
sinfo -h -o "%D|%T|%b"
,scontrol show nodes -o
Metric | Description | Labels |
---|---|---|
slurm_nodes_alloc |
Allocated nodes | partition , active_feature_set |
slurm_nodes_comp |
Completing nodes | partition , active_feature_set |
slurm_nodes_down |
Down nodes | partition , active_feature_set |
slurm_nodes_drain |
Drain nodes | partition , active_feature_set |
slurm_nodes_err |
Error nodes | partition , active_feature_set |
slurm_nodes_fail |
Fail nodes | partition , active_feature_set |
slurm_nodes_idle |
Idle nodes | partition , active_feature_set |
slurm_nodes_maint |
Maint nodes | partition , active_feature_set |
slurm_nodes_mix |
Mix nodes | partition , active_feature_set |
slurm_nodes_resv |
Reserved nodes | partition , active_feature_set |
slurm_nodes_other |
Nodes reported with an unknown state | partition , active_feature_set |
slurm_nodes_planned |
Planned nodes | partition , active_feature_set |
slurm_nodes_total |
Total number of nodes | (none) |
Provides metrics on CPU usage and pending jobs for each partition.
- Commands:
sinfo -h -o "%R,%C"
,squeue -a -r -h -o "%P" --states=PENDING
Metric | Description | Labels |
---|---|---|
slurm_partition_cpus_allocated |
Allocated CPUs for partition | partition |
slurm_partition_cpus_idle |
Idle CPUs for partition | partition |
slurm_partition_cpus_other |
Other CPUs for partition | partition |
slurm_partition_jobs_pending |
Pending jobs for partition | partition |
slurm_partition_cpus_total |
Total CPUs for partition | partition |
Provides detailed metrics on job states and resource usage.
- Command:
squeue -h -o "%P,%T,%C,%r,%u"
Metric | Description | Labels |
---|---|---|
slurm_queue_pending |
Pending jobs in queue | user , partition , reason |
slurm_queue_running |
Running jobs in the cluster | user , partition |
slurm_queue_suspended |
Suspended jobs in the cluster | user , partition |
slurm_cores_pending |
Pending cores in queue | user , partition , reason |
slurm_cores_running |
Running cores in the cluster | user , partition |
... |
(and many other states: completed , failed , etc.) |
user , partition |
Provides metrics about active Slurm reservations.
- Command:
scontrol show reservation
Metric | Description | Labels |
---|---|---|
slurm_reservation_info |
A metric with a constant '1' value labeled by reservation details | reservation_name , state , users , nodes , partition , flags |
slurm_reservation_start_time_seconds |
The start time of the reservation in seconds since the Unix epoch | reservation_name |
slurm_reservation_end_time_seconds |
The end time of the reservation in seconds since the Unix epoch | reservation_name |
slurm_reservation_node_count |
The number of nodes allocated to the reservation | reservation_name |
slurm_reservation_core_count |
The number of cores allocated to the reservation | reservation_name |
Provides internal performance metrics from the slurmctld
daemon.
- Command:
sdiag
Metric | Description | Labels |
---|---|---|
slurm_scheduler_threads |
Number of scheduler threads | (none) |
slurm_scheduler_queue_size |
Length of the scheduler queue | (none) |
slurm_scheduler_mean_cycle |
Scheduler mean cycle time (microseconds) | (none) |
slurm_rpc_stats |
RPC count statistic | operation |
slurm_user_rpc_stats |
RPC count statistic per user | user |
... |
(and many other backfill and RPC time metrics) | operation or user |
Provides job statistics aggregated by user.
- Command:
squeue -a -r -h -o "%A|%u|%T|%C"
Metric | Description | Labels |
---|---|---|
slurm_user_jobs_pending |
Pending jobs for user | user |
slurm_user_jobs_running |
Running jobs for user | user |
slurm_user_cpus_running |
Running cpus for user | user |
slurm_user_jobs_suspended |
Suspended jobs for user | user |
scrape_configs:
- job_name: 'slurm_exporter'
scrape_interval: 30s
scrape_timeout: 30s
static_configs:
- targets: ['slurm_host.fqdn:9341']
- scrape_interval: A 30s interval is recommended to avoid overloading the Slurm master with frequent command executions.
- scrape_timeout: Should be equal to or less than the
scrape_interval
to preventcontext_deadline_exceeded
errors.
Check config:
promtool check-config prometheus.yml
-
Command Timeout: The default timeout is 5 seconds. Increase it if Slurm commands take longer in your environment:
./slurm_exporter --command.timeout=10s
-
Scrape Interval: Use at least 30 seconds to avoid overloading the Slurm controller with frequent command executions.
-
Collector Selection: Disable unused collectors to reduce load and improve performance:
./slurm_exporter --no-collector.fairshare --no-collector.reservations
A Grafana dashboard is available:
This project is licensed under the GNU General Public License, version 3 or later.
This project is a fork of cea-hpc/slurm_exporter, which itself is a fork of vpenso/prometheus-slurm-exporter (now apparently unmaintained).
Feel free to contribute or open issues!