Skip to content

Slurm Exporter is a Prometheus exporter designed to scrape and expose a comprehensive range of performance and scheduling metrics from Slurm-managed clusters. It supports both CPU and GPU resource accounting, node and partition state monitoring, job tracking, and scheduler statistics.

License

Notifications You must be signed in to change notification settings

SckyzO/slurm_exporter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Prometheus Slurm Exporter πŸš€

Release Create Dev Release GitHub release (latest by date) Go Report Card License: GPL v3

Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.

πŸ“‹ Table of Contents


✨ Features

  • βœ… Exports a wide range of metrics from Slurm, including nodes, partitions, jobs, CPUs, and GPUs.
  • βœ… All metric collectors are optional and can be enabled/disabled via flags.
  • βœ… Supports TLS and Basic Authentication for secure connections.
  • βœ… Ready-to-use Grafana dashboard.

πŸ“¦ Installation

There are two recommended ways to install the Slurm Exporter.

1. From Pre-compiled Releases

This is the easiest method for most users.

  1. Download the latest release for your OS and architecture from the GitHub Releases page. πŸ“₯

  2. Place the slurm_exporter binary in a suitable location on a node with Slurm CLI access, such as /usr/local/bin/.

  3. Ensure the binary is executable:

    chmod +x /usr/local/bin/slurm_exporter
  4. (Optional) To run the exporter as a service, you can adapt the example Systemd unit file provided in this repository at systemd/slurm_exporter.service.

    • Copy it to /etc/systemd/system/slurm_exporter.service and customize it for your environment (especially the ExecStart path).

    • Reload the Systemd daemon, then enable and start the service:

      sudo systemctl daemon-reload
      sudo systemctl enable slurm_exporter
      sudo systemctl start slurm_exporter

2. From Source

If you want to build the exporter yourself, you can do so using the provided Makefile. πŸ‘©β€πŸ’»

  1. Clone the repository:

    git clone https://github.com/sckyzo/slurm_exporter.git
    cd slurm_exporter
  2. Build the binary:

    make build
  3. The new binary will be available at bin/slurm_exporter. You can then copy it to a location like /usr/local/bin/ and set up the Systemd service as described in the section above.


βš™οΈ Usage

The exporter can be configured using command-line flags.

Basic execution:

./slurm_exporter --web.listen-address=":9341"

Using a configuration file for web settings (TLS/Basic Auth):

./slurm_exporter --web.config.file=/path/to/web-config.yml

For details on the web-config.yml format, see the Exporter Toolkit documentation.

View help and all available options:

./slurm_exporter --help

Command-Line Options

Flag Description Default
--web.listen-address Address to listen on for web interface and telemetry :9341
--web.config.file Path to configuration file for TLS/Basic Auth (none)
--command.timeout Timeout for executing Slurm commands 5s
--log.level Log level: debug, info, warn, error info
--log.format Log format: json, text text
--collector.<name> Enable the specified collector true (all enabled by default)
--no-collector.<name> Disable the specified collector (none)

Available collectors: accounts, cpus, fairshare, gpus, info, node, nodes, partitions, queue, reservations, scheduler, users

Enabling and Disabling Collectors

By default, all collectors are enabled.

You can control which collectors are active using the --collector.<name> and --no-collector.<name> flags.

Example: Disable the scheduler and partitions collectors

./slurm_exporter --no-collector.scheduler --no-collector.partitions

Example: Disable the gpus collector

./slurm_exporter --no-collector.gpus

Example: Run only the nodes and cpus collectors

This requires disabling all other collectors individually.

./slurm_exporter \
  --no-collector.accounts \
  --no-collector.fairshare \
  --no-collector.gpus \
  --no-collector.node \
  --no-collector.partitions \
  --no-collector.queue \
  --no-collector.reservations \
  --no-collector.scheduler \
  --no-collector.info \
  --no-collector.users

Example: Custom timeout and logging

./slurm_exporter \
  --command.timeout=10s \
  --log.level=debug \
  --log.format=json

πŸ› οΈ Development

This project requires access to a node with the Slurm CLI (sinfo, squeue, sdiag, etc.).

Prerequisites

  • Go (version 1.22 or higher recommended)
  • Slurm CLI tools available in your $PATH

Building from Source

  1. Clone this repository:

    git clone https://github.com/sckyzo/slurm_exporter.git
    cd slurm_exporter
  2. Build the exporter binary:

    make build

    The binary will be available in bin/slurm_exporter.

Running Tests

To run all tests:

make test

Development Commands

Clean build artifacts:

make clean

Run the exporter locally:

bin/slurm_exporter --web.listen-address=:8080

Query metrics:

curl http://localhost:8080/metrics

Advanced build options: You can override the Go version and architecture via environment variables:

make build GO_VERSION=1.22.2 OS=linux ARCH=amd64

πŸ“Š Metrics

The exporter provides a wide range of metrics, each collected by a specific, toggleable collector.

accounts Collector

Provides job statistics aggregated by Slurm account.

  • Command: squeue -a -r -h -o "%A|%a|%T|%C"
Metric Description Labels
slurm_account_jobs_pending Pending jobs for account account
slurm_account_jobs_running Running jobs for account account
slurm_account_cpus_running Running cpus for account account
slurm_account_jobs_suspended Suspended jobs for account account

cpus Collector

Provides global statistics on CPU states for the entire cluster.

  • Command: sinfo -h -o "%C"
Metric Description Labels
slurm_cpus_alloc Allocated CPUs (none)
slurm_cpus_idle Idle CPUs (none)
slurm_cpus_other Mix CPUs (none)
slurm_cpus_total Total CPUs (none)

fairshare Collector

Reports the calculated fairshare factor for each account.

  • Command: sshare -n -P -o "account,fairshare"
Metric Description Labels
slurm_account_fairshare FairShare for account account

gpus Collector

Provides global statistics on GPU states for the entire cluster.

⚠️ Note: This collector is enabled by default. Disable it with --no-collector.gpus if not needed.

  • Command: sinfo (with various formats)
Metric Description Labels
slurm_gpus_alloc Allocated GPUs (none)
slurm_gpus_idle Idle GPUs (none)
slurm_gpus_other Other GPUs (none)
slurm_gpus_total Total GPUs (none)
slurm_gpus_utilization Total GPU utilization (none)

info Collector

Exposes the version of Slurm and the availability of different Slurm binaries.

  • Command: <binary> --version
Metric Description Labels
slurm_info Information on Slurm version and binaries type, binary, version

node Collector

Provides detailed, per-node metrics for CPU and memory usage.

  • Command: sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong,Partition"
Metric Description Labels
slurm_node_cpu_alloc Allocated CPUs per node node, status, partition
slurm_node_cpu_idle Idle CPUs per node node, status, partition
slurm_node_cpu_other Other CPUs per node node, status, partition
slurm_node_cpu_total Total CPUs per node node, status, partition
slurm_node_mem_alloc Allocated memory per node node, status, partition
slurm_node_mem_total Total memory per node node, status, partition
slurm_node_status Node Status with partition (1 if up) node, status, partition

nodes Collector

Provides aggregated metrics on node states for the cluster.

  • Commands: sinfo -h -o "%D|%T|%b", scontrol show nodes -o
Metric Description Labels
slurm_nodes_alloc Allocated nodes partition, active_feature_set
slurm_nodes_comp Completing nodes partition, active_feature_set
slurm_nodes_down Down nodes partition, active_feature_set
slurm_nodes_drain Drain nodes partition, active_feature_set
slurm_nodes_err Error nodes partition, active_feature_set
slurm_nodes_fail Fail nodes partition, active_feature_set
slurm_nodes_idle Idle nodes partition, active_feature_set
slurm_nodes_maint Maint nodes partition, active_feature_set
slurm_nodes_mix Mix nodes partition, active_feature_set
slurm_nodes_resv Reserved nodes partition, active_feature_set
slurm_nodes_other Nodes reported with an unknown state partition, active_feature_set
slurm_nodes_planned Planned nodes partition, active_feature_set
slurm_nodes_total Total number of nodes (none)

partitions Collector

Provides metrics on CPU usage and pending jobs for each partition.

  • Commands: sinfo -h -o "%R,%C", squeue -a -r -h -o "%P" --states=PENDING
Metric Description Labels
slurm_partition_cpus_allocated Allocated CPUs for partition partition
slurm_partition_cpus_idle Idle CPUs for partition partition
slurm_partition_cpus_other Other CPUs for partition partition
slurm_partition_jobs_pending Pending jobs for partition partition
slurm_partition_cpus_total Total CPUs for partition partition

queue Collector

Provides detailed metrics on job states and resource usage.

  • Command: squeue -h -o "%P,%T,%C,%r,%u"
Metric Description Labels
slurm_queue_pending Pending jobs in queue user, partition, reason
slurm_queue_running Running jobs in the cluster user, partition
slurm_queue_suspended Suspended jobs in the cluster user, partition
slurm_cores_pending Pending cores in queue user, partition, reason
slurm_cores_running Running cores in the cluster user, partition
... (and many other states: completed, failed, etc.) user, partition

reservations Collector

Provides metrics about active Slurm reservations.

  • Command: scontrol show reservation
Metric Description Labels
slurm_reservation_info A metric with a constant '1' value labeled by reservation details reservation_name, state, users, nodes, partition, flags
slurm_reservation_start_time_seconds The start time of the reservation in seconds since the Unix epoch reservation_name
slurm_reservation_end_time_seconds The end time of the reservation in seconds since the Unix epoch reservation_name
slurm_reservation_node_count The number of nodes allocated to the reservation reservation_name
slurm_reservation_core_count The number of cores allocated to the reservation reservation_name

scheduler Collector

Provides internal performance metrics from the slurmctld daemon.

  • Command: sdiag
Metric Description Labels
slurm_scheduler_threads Number of scheduler threads (none)
slurm_scheduler_queue_size Length of the scheduler queue (none)
slurm_scheduler_mean_cycle Scheduler mean cycle time (microseconds) (none)
slurm_rpc_stats RPC count statistic operation
slurm_user_rpc_stats RPC count statistic per user user
... (and many other backfill and RPC time metrics) operation or user

users Collector

Provides job statistics aggregated by user.

  • Command: squeue -a -r -h -o "%A|%u|%T|%C"
Metric Description Labels
slurm_user_jobs_pending Pending jobs for user user
slurm_user_jobs_running Running jobs for user user
slurm_user_cpus_running Running cpus for user user
slurm_user_jobs_suspended Suspended jobs for user user

πŸ“‘ Prometheus Configuration

scrape_configs:
  - job_name: 'slurm_exporter'
    scrape_interval: 30s
    scrape_timeout: 30s
    static_configs:
      - targets: ['slurm_host.fqdn:9341']
  • scrape_interval: A 30s interval is recommended to avoid overloading the Slurm master with frequent command executions.
  • scrape_timeout: Should be equal to or less than the scrape_interval to prevent context_deadline_exceeded errors.

Check config:

promtool check-config prometheus.yml

Performance Considerations

  • Command Timeout: The default timeout is 5 seconds. Increase it if Slurm commands take longer in your environment:

    ./slurm_exporter --command.timeout=10s
  • Scrape Interval: Use at least 30 seconds to avoid overloading the Slurm controller with frequent command executions.

  • Collector Selection: Disable unused collectors to reduce load and improve performance:

    ./slurm_exporter --no-collector.fairshare --no-collector.reservations

πŸ“ˆ Grafana Dashboard

A Grafana dashboard is available:

Node Status Job Status Scheduler Info


πŸ“œ License

This project is licensed under the GNU General Public License, version 3 or later.

Buy Me a Coffee


🍴 About this fork

This project is a fork of cea-hpc/slurm_exporter, which itself is a fork of vpenso/prometheus-slurm-exporter (now apparently unmaintained).

Feel free to contribute or open issues!

About

Slurm Exporter is a Prometheus exporter designed to scrape and expose a comprehensive range of performance and scheduling metrics from Slurm-managed clusters. It supports both CPU and GPU resource accounting, node and partition state monitoring, job tracking, and scheduler statistics.

Resources

License

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published

Contributors 16