Prometheus Slurm Exporter 🚀

Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.

📋 Table of Contents

Prometheus Slurm Exporter 🚀

✨ Features

✅ Exports a wide range of metrics from Slurm, including nodes, partitions, jobs, CPUs, and GPUs.
✅ All metric collectors are optional and can be enabled/disabled via flags.
✅ Supports TLS and Basic Authentication for secure connections.
✅ Ready-to-use Grafana dashboard.

📦 Installation

There are two recommended ways to install the Slurm Exporter.

1. From Pre-compiled Releases

This is the easiest method for most users.

Download the latest release for your OS and architecture from the GitHub Releases page. 📥
Place the slurm_exporter binary in a suitable location on a node with Slurm CLI access, such as /usr/local/bin/.
Ensure the binary is executable:
```
chmod +x /usr/local/bin/slurm_exporter
```
(Optional) To run the exporter as a service, you can adapt the example Systemd unit file provided in this repository at systemd/slurm_exporter.service.
- Copy it to /etc/systemd/system/slurm_exporter.service and customize it for your environment (especially the ExecStart path).
- Reload the Systemd daemon, then enable and start the service:
```
sudo systemctl daemon-reload
sudo systemctl enable slurm_exporter
sudo systemctl start slurm_exporter
```

2. From Source

If you want to build the exporter yourself, you can do so using the provided Makefile. 👩‍💻

Clone the repository:

git clone https://github.com/sckyzo/slurm_exporter.git
cd slurm_exporter

Build the binary:
```
make build
```
The new binary will be available at bin/slurm_exporter. You can then copy it to a location like /usr/local/bin/ and set up the Systemd service as described in the section above.

⚙️ Usage

The exporter can be configured using command-line flags.

Basic execution:

./slurm_exporter --web.listen-address=":9341"

Using a configuration file for web settings (TLS/Basic Auth):

./slurm_exporter --web.config.file=/path/to/web-config.yml

For details on the web-config.yml format, see the Exporter Toolkit documentation.

View help and all available options:

./slurm_exporter --help

Command-Line Options

Flag	Description	Default
`--web.listen-address`	Address to listen on for web interface and telemetry	`:9341`
`--web.config.file`	Path to configuration file for TLS/Basic Auth	(none)
`--command.timeout`	Timeout for executing Slurm commands	`5s`
`--log.level`	Log level: `debug`, `info`, `warn`, `error`	`info`
`--log.format`	Log format: `json`, `text`	`text`
`--collector.<name>`	Enable the specified collector	`true` (all enabled by default)
`--no-collector.<name>`	Disable the specified collector	(none)

Available collectors: accounts, cpus, fairshare, gpus, info, node, nodes, partitions, queue, reservations, scheduler, users

Enabling and Disabling Collectors

By default, all collectors are enabled.

You can control which collectors are active using the --collector.<name> and --no-collector.<name> flags.

Example: Disable the scheduler and partitions collectors

./slurm_exporter --no-collector.scheduler --no-collector.partitions

Example: Disable the gpus collector

./slurm_exporter --no-collector.gpus

Example: Run only the nodes and cpus collectors

This requires disabling all other collectors individually.

./slurm_exporter \
  --no-collector.accounts \
  --no-collector.fairshare \
  --no-collector.gpus \
  --no-collector.node \
  --no-collector.partitions \
  --no-collector.queue \
  --no-collector.reservations \
  --no-collector.scheduler \
  --no-collector.info \
  --no-collector.users

Example: Custom timeout and logging

./slurm_exporter \
  --command.timeout=10s \
  --log.level=debug \
  --log.format=json

🛠️ Development

This project requires access to a node with the Slurm CLI (sinfo, squeue, sdiag, etc.).

Prerequisites

Go (version 1.22 or higher recommended)
Slurm CLI tools available in your $PATH

Building from Source

Clone this repository:

git clone https://github.com/sckyzo/slurm_exporter.git
cd slurm_exporter

Build the exporter binary:
```
make build
```
The binary will be available in bin/slurm_exporter.

Running Tests

To run all tests:

make test

Development Commands

Clean build artifacts:

make clean

Run the exporter locally:

bin/slurm_exporter --web.listen-address=:8080

Query metrics:

curl http://localhost:8080/metrics

Advanced build options: You can override the Go version and architecture via environment variables:

make build GO_VERSION=1.22.2 OS=linux ARCH=amd64

📊 Metrics

The exporter provides a wide range of metrics, each collected by a specific, toggleable collector.

`accounts` Collector

Provides job statistics aggregated by Slurm account.

Command: squeue -a -r -h -o "%A|%a|%T|%C"

Metric	Description	Labels
`slurm_account_jobs_pending`	Pending jobs for account	`account`
`slurm_account_jobs_running`	Running jobs for account	`account`
`slurm_account_cpus_running`	Running cpus for account	`account`
`slurm_account_jobs_suspended`	Suspended jobs for account	`account`

`cpus` Collector

Provides global statistics on CPU states for the entire cluster.

Command: sinfo -h -o "%C"

Metric	Description	Labels
`slurm_cpus_alloc`	Allocated CPUs	(none)
`slurm_cpus_idle`	Idle CPUs	(none)
`slurm_cpus_other`	Mix CPUs	(none)
`slurm_cpus_total`	Total CPUs	(none)

`fairshare` Collector

Reports the calculated fairshare factor for each account.

Command: sshare -n -P -o "account,fairshare"

Metric	Description	Labels
`slurm_account_fairshare`	FairShare for account	`account`

`gpus` Collector

Provides global statistics on GPU states for the entire cluster.

⚠️ Note: This collector is enabled by default. Disable it with --no-collector.gpus if not needed.

Command: sinfo (with various formats)

Metric	Description	Labels
`slurm_gpus_alloc`	Allocated GPUs	(none)
`slurm_gpus_idle`	Idle GPUs	(none)
`slurm_gpus_other`	Other GPUs	(none)
`slurm_gpus_total`	Total GPUs	(none)
`slurm_gpus_utilization`	Total GPU utilization	(none)

`info` Collector

Exposes the version of Slurm and the availability of different Slurm binaries.

Command: <binary> --version

Metric	Description	Labels
`slurm_info`	Information on Slurm version and binaries	`type`, `binary`, `version`

`node` Collector

Provides detailed, per-node metrics for CPU and memory usage.

Command: sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong,Partition"

Metric	Description	Labels
`slurm_node_cpu_alloc`	Allocated CPUs per node	`node`, `status`, `partition`
`slurm_node_cpu_idle`	Idle CPUs per node	`node`, `status`, `partition`
`slurm_node_cpu_other`	Other CPUs per node	`node`, `status`, `partition`
`slurm_node_cpu_total`	Total CPUs per node	`node`, `status`, `partition`
`slurm_node_mem_alloc`	Allocated memory per node	`node`, `status`, `partition`
`slurm_node_mem_total`	Total memory per node	`node`, `status`, `partition`
`slurm_node_status`	Node Status with partition (1 if up)	`node`, `status`, `partition`

`nodes` Collector

Provides aggregated metrics on node states for the cluster.

Commands: sinfo -h -o "%D|%T|%b", scontrol show nodes -o

Metric	Description	Labels
`slurm_nodes_alloc`	Allocated nodes	`partition`, `active_feature_set`
`slurm_nodes_comp`	Completing nodes	`partition`, `active_feature_set`
`slurm_nodes_down`	Down nodes	`partition`, `active_feature_set`
`slurm_nodes_drain`	Drain nodes	`partition`, `active_feature_set`
`slurm_nodes_err`	Error nodes	`partition`, `active_feature_set`
`slurm_nodes_fail`	Fail nodes	`partition`, `active_feature_set`
`slurm_nodes_idle`	Idle nodes	`partition`, `active_feature_set`
`slurm_nodes_maint`	Maint nodes	`partition`, `active_feature_set`
`slurm_nodes_mix`	Mix nodes	`partition`, `active_feature_set`
`slurm_nodes_resv`	Reserved nodes	`partition`, `active_feature_set`
`slurm_nodes_other`	Nodes reported with an unknown state	`partition`, `active_feature_set`
`slurm_nodes_planned`	Planned nodes	`partition`, `active_feature_set`
`slurm_nodes_total`	Total number of nodes	(none)

`partitions` Collector

Provides metrics on CPU usage and pending jobs for each partition.

Commands: sinfo -h -o "%R,%C", squeue -a -r -h -o "%P" --states=PENDING

Metric	Description	Labels
`slurm_partition_cpus_allocated`	Allocated CPUs for partition	`partition`
`slurm_partition_cpus_idle`	Idle CPUs for partition	`partition`
`slurm_partition_cpus_other`	Other CPUs for partition	`partition`
`slurm_partition_jobs_pending`	Pending jobs for partition	`partition`
`slurm_partition_cpus_total`	Total CPUs for partition	`partition`

`queue` Collector

Provides detailed metrics on job states and resource usage.

Command: squeue -h -o "%P,%T,%C,%r,%u"

Metric	Description	Labels
`slurm_queue_pending`	Pending jobs in queue	`user`, `partition`, `reason`
`slurm_queue_running`	Running jobs in the cluster	`user`, `partition`
`slurm_queue_suspended`	Suspended jobs in the cluster	`user`, `partition`
`slurm_cores_pending`	Pending cores in queue	`user`, `partition`, `reason`
`slurm_cores_running`	Running cores in the cluster	`user`, `partition`
`...`	(and many other states: `completed`, `failed`, etc.)	`user`, `partition`

`reservations` Collector

Provides metrics about active Slurm reservations.

Command: scontrol show reservation

Metric	Description	Labels
`slurm_reservation_info`	A metric with a constant '1' value labeled by reservation details	`reservation_name`, `state`, `users`, `nodes`, `partition`, `flags`
`slurm_reservation_start_time_seconds`	The start time of the reservation in seconds since the Unix epoch	`reservation_name`
`slurm_reservation_end_time_seconds`	The end time of the reservation in seconds since the Unix epoch	`reservation_name`
`slurm_reservation_node_count`	The number of nodes allocated to the reservation	`reservation_name`
`slurm_reservation_core_count`	The number of cores allocated to the reservation	`reservation_name`

`scheduler` Collector

Provides internal performance metrics from the slurmctld daemon.

Command: sdiag

Metric	Description	Labels
`slurm_scheduler_threads`	Number of scheduler threads	(none)
`slurm_scheduler_queue_size`	Length of the scheduler queue	(none)
`slurm_scheduler_mean_cycle`	Scheduler mean cycle time (microseconds)	(none)
`slurm_rpc_stats`	RPC count statistic	`operation`
`slurm_user_rpc_stats`	RPC count statistic per user	`user`
`...`	(and many other backfill and RPC time metrics)	`operation` or `user`

`users` Collector

Provides job statistics aggregated by user.

Command: squeue -a -r -h -o "%A|%u|%T|%C"

Metric	Description	Labels
`slurm_user_jobs_pending`	Pending jobs for user	`user`
`slurm_user_jobs_running`	Running jobs for user	`user`
`slurm_user_cpus_running`	Running cpus for user	`user`
`slurm_user_jobs_suspended`	Suspended jobs for user	`user`

📡 Prometheus Configuration

scrape_configs:
  - job_name: 'slurm_exporter'
    scrape_interval: 30s
    scrape_timeout: 30s
    static_configs:
      - targets: ['slurm_host.fqdn:9341']

scrape_interval: A 30s interval is recommended to avoid overloading the Slurm master with frequent command executions.
scrape_timeout: Should be equal to or less than the scrape_interval to prevent context_deadline_exceeded errors.

Check config:

promtool check-config prometheus.yml

Performance Considerations

Command Timeout: The default timeout is 5 seconds. Increase it if Slurm commands take longer in your environment:
```
./slurm_exporter --command.timeout=10s
```
Scrape Interval: Use at least 30 seconds to avoid overloading the Slurm controller with frequent command executions.
Collector Selection: Disable unused collectors to reduce load and improve performance:
```
./slurm_exporter --no-collector.fairshare --no-collector.reservations
```

📈 Grafana Dashboard

A Grafana dashboard is available:

📜 License

This project is licensed under the GNU General Public License, version 3 or later.

🍴 About this fork

This project is a fork of cea-hpc/slurm_exporter, which itself is a fork of vpenso/prometheus-slurm-exporter (now apparently unmaintained).

Feel free to contribute or open issues!

Name		Name	Last commit message	Last commit date
Latest commit History 266 Commits
.github		.github
cmd/slurm_exporter		cmd/slurm_exporter
images		images
internal		internal
systemd		systemd
test_data		test_data
.gitignore		.gitignore
.goreleaser.dev.yaml		.goreleaser.dev.yaml
.goreleaser.yaml		.goreleaser.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Uh oh!

License

SckyzO/slurm_exporter

Folders and files

Latest commit

History

Repository files navigation

Prometheus Slurm Exporter 🚀

📋 Table of Contents

✨ Features

📦 Installation

1. From Pre-compiled Releases

2. From Source

⚙️ Usage

Command-Line Options

Enabling and Disabling Collectors

🛠️ Development

Prerequisites

Building from Source

Running Tests

Development Commands

📊 Metrics

accounts Collector

cpus Collector

fairshare Collector

gpus Collector

info Collector

node Collector

nodes Collector

partitions Collector

queue Collector

reservations Collector

scheduler Collector

users Collector

📡 Prometheus Configuration

Performance Considerations

📈 Grafana Dashboard

📜 License

🍴 About this fork

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Sponsor this project

Uh oh!

Packages 0

Contributors 16

Uh oh!

Languages

`accounts` Collector

`cpus` Collector

`fairshare` Collector

`gpus` Collector

`info` Collector

`node` Collector

`nodes` Collector

`partitions` Collector

`queue` Collector

`reservations` Collector

`scheduler` Collector

`users` Collector

Packages