[systemdreceiver] Add support for scraping unit status #42058

SquidDev · 2025-08-17T17:48:02Z

Description

This is start at implementing #33532. I'm trying to do this in relatively minor increments to make this reviewable, so this only adds support for reporting the active units' status.

This follows a similar approach to the httpcheck receiver, with an attribute for every possible state, with one metric set to 1, and the rest set to 0.

systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="active"} = 1
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="reloading"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="inactive"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="failed"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="activating"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="deactivating"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="maintenance"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="refreshing"} = 0

Link to tracking issue

Part of #33532, but possibly not enough to mark as fixed.

Testing

I have done some manual testing, and confirmed this generates the metrics I'd expected. Aside from the auto-generated tests, I'm afraid I've not written any unit/integration tests — I wanted some advice here first, about whether it was better to try to mock out the dbus interface, or instead try to run the tests against a real systemd instance.

Documentation

I've updated the README of the systemdreceiver component to mention the metric exposed, and the configuration options.

linux-foundation-easycla · 2025-08-17T17:48:09Z

The committers listed above are authorized under a signed CLA.

✅ login: SquidDev / name: Jonathan Coates (979b994, d7b1ad6)

SquidDev · 2025-08-19T17:13:39Z

Sorry about those bugs! I'm afraid this is very much baby's first Go code — I thought I'd been so careful about running the make tasks locally, but clearly not!

SquidDev · 2025-08-24T14:32:32Z

Have rebased on top of master, and tried to stub out the dbus connection, so we can write some tests for this. Coverage is now at the requested 80% :).

I'm aware this PR is now well over the requested 500 lines. The majority of this is generated code (or golden test data), so hopefully shouldn't be too bad to review, but let me know there's a way to split this up further.

nichenke · 2025-08-26T21:54:56Z

Following along - definitely interested. I'm poking at the cgroups side of things locally -- not sure if we need to vendor in cadvisor like awscontainerinsightreceiver or the simpler podman/k8s/docker stats methods will work.

receiver/systemdreceiver/testdata/expected_metrics/metrics_golden.yaml

SquidDev · 2025-08-27T06:17:38Z

For cgroups, I'm currently just using https://github.com/containerd/cgroups/ directy. Though this does make supporting cgroups v1 and v2 a bit more awkward — I'm currently just focusing on the latter.

I'd actually been drawing more inspiration for the host receiver (as much as I can, sadly cgroups has a different model for a lot of things). So CPU usage mode (kernel vs user) is an attribute rather than having separate container.cpu.usage.{kernel,user}mode metrics.

I'm not 100% sure this is the right approach, so thoughts welcome!

- Store possible unit states in a slice, to ensure metrics are emitted in a deterministic order.

SquidDev · 2025-08-27T07:13:44Z

Have rebased on top of master (again 😭, y'all are too productive!) to fix go.mod/go.sum merge conflicts. Have also updated the version numbers in go.mod, so that should fix the CI failures.

Is there a better way to handle this? I feel I've been doing a lot of manual go.mod maintenance (e.g. merging require blocks, fixing version numbers), and running go mod tidy after everything, so I feel like I might be missing something!

Have also switched the unit name to be a resource attribute, which I think is a bit more consistent with what the other receivers do.

nichenke · 2025-09-03T14:58:07Z

Receiver is testing out nicely here - with the recent resource unit name change, I needed this config in the prometheus exporter to see all the metrics -- without, just a single series is shown.

    resource_to_telemetry_conversion:
      enabled: true

github-actions · 2025-09-18T05:21:36Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

SquidDev · 2025-09-18T07:08:23Z

It's not stale, it's waiting for review.

SquidDev · 2025-09-18T07:40:37Z

Is there a better way to handle this? I feel I've been doing a lot of manual go.mod maintenance (e.g. merging require blocks, fixing version numbers), and running go mod tidy after everything, so I feel like I might be missing something!

Particularly struggling with making sure the version numbers in the go.mod file are correct after a merge/rebase. If I run .github/workflows/scripts/check-collector-module-version.sh locally (which AFIACT is the job which checks this), then I get tonnes of spurious changes like go.opentelemetry.io/collector/pipeline v1.41.1-0.20250911155607-37a3ace6274c → v1.39.0.

Edit: Ahh, I see. I need to run make genotelcontribcol first.

josepcorrea · 2025-10-06T06:47:37Z

Hi @SquidDev and reviewers,

This is a fantastic and highly anticipated feature for the systemdreceiver. Great work getting this to the testing phase!

I would like to help you and the team by running functional and integration tests on my end, specifically focusing on validating the metric output and the behavior of the resource_to_telemetry_conversion.

To ensure I'm testing against the correct environment and configuration (especially regarding D-Bus stubbing/mocking, as you mentioned), could you please provide a brief guide or an example configuration for how to:

Run the receiver with the current branch/PR implementation in a testing environment (e.g., within a Docker container or against a specific systemd setup).
Generate the expected unit states (e.g., forcing a unit into 'failed' or 'reloading' state) to check the binary metric logic.

I will follow up here with any functional test findings, bug reports, or potential suggestions for improvements as soon as I can replicate the setup.

Thanks for your efforts!

SquidDev · 2025-10-06T08:37:03Z

@josepcorrea Thanks for helping with this!

Run the receiver with the current branch/PR implementation

Currently the systemd receiver is not built as part of the collector by default. This can be done by adding gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/systemdreceiver v0.136.0 to the receivers: block of cmd/otelcontribcol/builder-config.yaml, and then rebuilding the collector (make genotelcontribcol && make otelcontribcol).
For testing itself, I have put together a Dockerfile + config. This includes two services (otel-test-active, otel-test-failed), which will be active and failed respectively.
The collector can then be started by running otelcontribcol /etc/otel/config.yaml inside the container (via a separate docker exec).

Generate the expected unit states

So active/inactive/failed should be easy (for inactive, just systemctl stop otel-test-active). I think the other states are trickier -- you probably create them with a notify service type and a short Python script¹, but I'm not sure it's worth it; at that point you're pretty much just testing systemd rather than the collector!

e.g. reloading can be created by sending RELOADING=1 via sd_notify, and likewise with deactivating and STOPPING=1. ↩

atoulme · 2025-10-07T06:35:39Z

Please resolve the conflicts when you get a chance, I will review

atoulme · 2025-10-07T06:37:18Z

receiver/systemdreceiver/metadata.yaml

+    enabled: true
+
+attributes:
+  systemd.unit.active_state:


Suggested change

systemd.unit.active_state:

systemd.unit.state:

why not state?

Systemd units have three states: load state, (whether the unit has been loaded), active state (whether it is running) and sub state (unit defined state).

From discussions in #37169 and #33532, we don't want to expose all three due to cardinality concerns, but it felt helpful to make explicit which of the states we were referring to.

atoulme

LGTM, but please help understand why use active_state instead of state?

SquidDev · 2025-10-14T14:51:15Z

Is there else needed from me for this PR, or is this just waiting for a decision on state vs active_state?

pjanotti · 2025-10-20T23:05:33Z

receiver/systemdreceiver/scraper.go

+	var conn *dbus.Conn
+	switch s.cfg.Scope {
+	case "system":
+		conn, err = dbus.ConnectSystemBus(dbus.WithContext(ctx))
+	case "user":
+		conn, err = dbus.ConnectSessionBus(dbus.WithContext(ctx))
+	default:
+		return errInvalidScope
+	}
+
+	if err != nil {
+		return err
+	}
+
+	s.conn = conn
+
+	return err
+}


nit: unless the dbus.Connect* doesn't follow typical conventions for go, i.e.: conn should be nil if err is not, the code can follow the usual pattern:

Suggested change

var conn *dbus.Conn

switch s.cfg.Scope {

case "system":

conn, err = dbus.ConnectSystemBus(dbus.WithContext(ctx))

case "user":

conn, err = dbus.ConnectSessionBus(dbus.WithContext(ctx))

default:

return errInvalidScope

}

if err != nil {

return err

}

s.conn = conn

return err

}

switch s.cfg.Scope {

case "system":

s.conn, err = dbus.ConnectSystemBus(dbus.WithContext(ctx))

case "user":

s.conn, err = dbus.ConnectSessionBus(dbus.WithContext(ctx))

default:

err = errInvalidScope

}

return err

}

pjanotti · 2025-10-20T23:12:25Z

receiver/systemdreceiver/testdata/expected_metrics/basic-scrape.yaml

+resourceMetrics:
+  - resource:
+      attributes:
+        - key: systemd.unit.name


Hum, my tendency is to have this a metric attribute. Why having it as a resource attribute?

I originally had these as metric attributes. However, when adding cgroup support to this receiver, it felt that having a receiver per unit made a bit more sense, as the metrics are a bit more grouped together. This is what we do for the various container receivers (Docker, Podman, etc...) for example.

It is a bit of a gray area, right now I'm not convinced yet that it is the better trade off here. That said the component is still in development so we can experiment and adjust as appropriate.

receiver/systemdreceiver/metadata.yaml

atoulme · 2025-10-21T00:04:54Z

Please take a look at the failing changelog check. We can merge as is and address feedback afterwards.

SquidDev · 2025-10-21T07:19:23Z

Ahh, sorry! Thought I'd caught all the upstream changes that had gone in (had fixed the issues caused by a gofumpt bump), but clearly missed this one. All fixed!

SquidDev · 2025-10-22T17:00:30Z

And of course there's another one! 😱

CONTRIBUTING.md currently tells you to manually run a dozen make commands when creating a PR, which I did! But then have not run all of them when resolving merge conflicts. I wonder if it would be worth recommending make checks instead (have only just discovered that), as it feels a bit easier to remember.

SquidDev requested review from a team and atoulme as code owners August 17, 2025 17:48

github-actions bot assigned dashpole Aug 17, 2025

github-actions bot added receiver/systemd cmd/otelcontribcol otelcontribcol command labels Aug 17, 2025

atoulme added the waiting-for-code-owners label Aug 19, 2025

SquidDev force-pushed the systemd-receiver branch from 7184a31 to b18d003 Compare August 24, 2025 14:08

atoulme reviewed Aug 26, 2025

View reviewed changes

receiver/systemdreceiver/testdata/expected_metrics/metrics_golden.yaml Outdated Show resolved Hide resolved

[systemdreceiver] Scraping unit status

5616828

SquidDev force-pushed the systemd-receiver branch from b18d003 to 5616828 Compare August 27, 2025 06:30

SquidDev added 2 commits August 27, 2025 07:41

[systemdreceiver] Fix go.mod

ba45aaa

[systemdreceiver] Expose unit name as a resource attribute

26d8b37

- Store possible unit states in a slice, to ensure metrics are emitted in a deterministic order.

github-actions bot added the Stale label Sep 18, 2025

Merge remote-tracking branch 'origin/main' into systemd-receiver

73b02cf

SquidDev added 2 commits September 18, 2025 08:43

Fix incorrect version number

4a4cc84

Oh, and tidy as well

a3ae77a

github-actions bot removed the Stale label Sep 19, 2025

SquidDev added 2 commits September 24, 2025 11:59

Merge remote-tracking branch 'origin/main' into systemd-receiver

7444929

Merge remote-tracking branch 'origin/main' into systemd-receiver

a428433

Merge remote-tracking branch 'origin/main' into systemd-receiver

46d1a56

atoulme reviewed Oct 7, 2025

View reviewed changes

atoulme approved these changes Oct 7, 2025

View reviewed changes

SquidDev added 2 commits October 7, 2025 10:24

Merge remote-tracking branch 'origin/main' into systemd-receiver

1cbdff5

Merge remote-tracking branch 'origin/main' into systemd-receiver

0fe28eb

Merge remote-tracking branch 'origin/main' into systemd-receiver

c6bf780

pjanotti reviewed Oct 20, 2025

View reviewed changes

Update changelog name to match changes in open-telemetry#43215

b8e8472

SquidDev added 2 commits October 22, 2025 08:00

Merge remote-tracking branch 'origin/main' into systemd-receiver

979b994

Make generate

d7b1ad6

[systemdreceiver] Add support for scraping unit status #42058

Are you sure you want to change the base?

[systemdreceiver] Add support for scraping unit status #42058

Uh oh!

Conversation

SquidDev commented Aug 17, 2025

Description

Link to tracking issue

Testing

Documentation

Uh oh!

linux-foundation-easycla bot commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SquidDev commented Aug 19, 2025

Uh oh!

SquidDev commented Aug 24, 2025

Uh oh!

nichenke commented Aug 26, 2025

Uh oh!

Uh oh!

SquidDev commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SquidDev commented Aug 27, 2025

Uh oh!

nichenke commented Sep 3, 2025

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

SquidDev commented Sep 18, 2025

Uh oh!

SquidDev commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josepcorrea commented Oct 6, 2025

Uh oh!

SquidDev commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

atoulme commented Oct 7, 2025

Uh oh!

atoulme Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

SquidDev Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

atoulme left a comment

Choose a reason for hiding this comment

Uh oh!

SquidDev commented Oct 14, 2025

Uh oh!

pjanotti Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

pjanotti Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

SquidDev Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pjanotti Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

atoulme commented Oct 21, 2025

Uh oh!

SquidDev commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SquidDev commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

linux-foundation-easycla bot commented Aug 17, 2025 •

edited

Loading

SquidDev commented Aug 27, 2025 •

edited

Loading

SquidDev commented Sep 18, 2025 •

edited

Loading

SquidDev commented Oct 6, 2025 •

edited

Loading

SquidDev Oct 21, 2025 •

edited

Loading

pjanotti Oct 21, 2025 •

edited

Loading

SquidDev commented Oct 21, 2025 •

edited

Loading