Skip to content

Conversation

@SquidDev
Copy link

Description

This is start at implementing #33532. I'm trying to do this in relatively minor increments to make this reviewable, so this only adds support for reporting the active units' status.

This follows a similar approach to the httpcheck receiver, with an attribute for every possible state, with one metric set to 1, and the rest set to 0.

systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="active"} = 1
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="reloading"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="inactive"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="failed"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="activating"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="deactivating"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="maintenance"} = 0
systemd.unit.state{systemd.unit.name="nginx", systemd.unit.active_state="refreshing"} = 0

Link to tracking issue

Part of #33532, but possibly not enough to mark as fixed.

Testing

I have done some manual testing, and confirmed this generates the metrics I'd expected. Aside from the auto-generated tests, I'm afraid I've not written any unit/integration tests — I wanted some advice here first, about whether it was better to try to mock out the dbus interface, or instead try to run the tests against a real systemd instance.

Documentation

I've updated the README of the systemdreceiver component to mention the metric exposed, and the configuration options.

@SquidDev SquidDev requested review from a team and atoulme as code owners August 17, 2025 17:48
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 17, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@SquidDev
Copy link
Author

Sorry about those bugs! I'm afraid this is very much baby's first Go code — I thought I'd been so careful about running the make tasks locally, but clearly not!

@SquidDev
Copy link
Author

Have rebased on top of master, and tried to stub out the dbus connection, so we can write some tests for this. Coverage is now at the requested 80% :).

I'm aware this PR is now well over the requested 500 lines. The majority of this is generated code (or golden test data), so hopefully shouldn't be too bad to review, but let me know there's a way to split this up further.

@nichenke
Copy link

Following along - definitely interested. I'm poking at the cgroups side of things locally -- not sure if we need to vendor in cadvisor like awscontainerinsightreceiver or the simpler podman/k8s/docker stats methods will work.

@SquidDev
Copy link
Author

SquidDev commented Aug 27, 2025

For cgroups, I'm currently just using https://github.com/containerd/cgroups/ directy. Though this does make supporting cgroups v1 and v2 a bit more awkward — I'm currently just focusing on the latter.

I'd actually been drawing more inspiration for the host receiver (as much as I can, sadly cgroups has a different model for a lot of things). So CPU usage mode (kernel vs user) is an attribute rather than having separate container.cpu.usage.{kernel,user}mode metrics.

I'm not 100% sure this is the right approach, so thoughts welcome!

 - Store possible unit states in a slice, to ensure metrics are emitted
   in a deterministic order.
@SquidDev
Copy link
Author

Have rebased on top of master (again 😭, y'all are too productive!) to fix go.mod/go.sum merge conflicts. Have also updated the version numbers in go.mod, so that should fix the CI failures.

Is there a better way to handle this? I feel I've been doing a lot of manual go.mod maintenance (e.g. merging require blocks, fixing version numbers), and running go mod tidy after everything, so I feel like I might be missing something!

Have also switched the unit name to be a resource attribute, which I think is a bit more consistent with what the other receivers do.

@nichenke
Copy link

nichenke commented Sep 3, 2025

Receiver is testing out nicely here - with the recent resource unit name change, I needed this config in the prometheus exporter to see all the metrics -- without, just a single series is shown.

    resource_to_telemetry_conversion:
      enabled: true

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Sep 18, 2025
@SquidDev
Copy link
Author

It's not stale, it's waiting for review.

@SquidDev
Copy link
Author

SquidDev commented Sep 18, 2025

Is there a better way to handle this? I feel I've been doing a lot of manual go.mod maintenance (e.g. merging require blocks, fixing version numbers), and running go mod tidy after everything, so I feel like I might be missing something!

Particularly struggling with making sure the version numbers in the go.mod file are correct after a merge/rebase. If I run .github/workflows/scripts/check-collector-module-version.sh locally (which AFIACT is the job which checks this), then I get tonnes of spurious changes like go.opentelemetry.io/collector/pipeline v1.41.1-0.20250911155607-37a3ace6274c → v1.39.0.

Edit: Ahh, I see. I need to run make genotelcontribcol first.

@github-actions github-actions bot removed the Stale label Sep 19, 2025
@josepcorrea
Copy link

Hi @SquidDev and reviewers,

This is a fantastic and highly anticipated feature for the systemdreceiver. Great work getting this to the testing phase!

I would like to help you and the team by running functional and integration tests on my end, specifically focusing on validating the metric output and the behavior of the resource_to_telemetry_conversion.

To ensure I'm testing against the correct environment and configuration (especially regarding D-Bus stubbing/mocking, as you mentioned), could you please provide a brief guide or an example configuration for how to:

  1. Run the receiver with the current branch/PR implementation in a testing environment (e.g., within a Docker container or against a specific systemd setup).
  2. Generate the expected unit states (e.g., forcing a unit into 'failed' or 'reloading' state) to check the binary metric logic.

I will follow up here with any functional test findings, bug reports, or potential suggestions for improvements as soon as I can replicate the setup.

Thanks for your efforts!

@SquidDev
Copy link
Author

SquidDev commented Oct 6, 2025

@josepcorrea Thanks for helping with this!

Run the receiver with the current branch/PR implementation

  1. Currently the systemd receiver is not built as part of the collector by default. This can be done by adding gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/systemdreceiver v0.136.0 to the receivers: block of cmd/otelcontribcol/builder-config.yaml, and then rebuilding the collector (make genotelcontribcol && make otelcontribcol).
  2. For testing itself, I have put together a Dockerfile + config. This includes two services (otel-test-active, otel-test-failed), which will be active and failed respectively.
  3. The collector can then be started by running otelcontribcol /etc/otel/config.yaml inside the container (via a separate docker exec).

Generate the expected unit states

So active/inactive/failed should be easy (for inactive, just systemctl stop otel-test-active). I think the other states are trickier -- you probably create them with a notify service type and a short Python script1, but I'm not sure it's worth it; at that point you're pretty much just testing systemd rather than the collector!

Footnotes

  1. e.g. reloading can be created by sending RELOADING=1 via sd_notify, and likewise with deactivating and STOPPING=1.

@atoulme
Copy link
Contributor

atoulme commented Oct 7, 2025

Please resolve the conflicts when you get a chance, I will review

enabled: true

attributes:
systemd.unit.active_state:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
systemd.unit.active_state:
systemd.unit.state:

why not state?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Systemd units have three states: load state, (whether the unit has been loaded), active state (whether it is running) and sub state (unit defined state).

From discussions in #37169 and #33532, we don't want to expose all three due to cardinality concerns, but it felt helpful to make explicit which of the states we were referring to.

Copy link
Contributor

@atoulme atoulme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please help understand why use active_state instead of state?

@SquidDev
Copy link
Author

Is there else needed from me for this PR, or is this just waiting for a decision on state vs active_state?

Comment on lines +35 to +52
var conn *dbus.Conn
switch s.cfg.Scope {
case "system":
conn, err = dbus.ConnectSystemBus(dbus.WithContext(ctx))
case "user":
conn, err = dbus.ConnectSessionBus(dbus.WithContext(ctx))
default:
return errInvalidScope
}

if err != nil {
return err
}

s.conn = conn

return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unless the dbus.Connect* doesn't follow typical conventions for go, i.e.: conn should be nil if err is not, the code can follow the usual pattern:

Suggested change
var conn *dbus.Conn
switch s.cfg.Scope {
case "system":
conn, err = dbus.ConnectSystemBus(dbus.WithContext(ctx))
case "user":
conn, err = dbus.ConnectSessionBus(dbus.WithContext(ctx))
default:
return errInvalidScope
}
if err != nil {
return err
}
s.conn = conn
return err
}
switch s.cfg.Scope {
case "system":
s.conn, err = dbus.ConnectSystemBus(dbus.WithContext(ctx))
case "user":
s.conn, err = dbus.ConnectSessionBus(dbus.WithContext(ctx))
default:
err = errInvalidScope
}
return err
}

resourceMetrics:
- resource:
attributes:
- key: systemd.unit.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, my tendency is to have this a metric attribute. Why having it as a resource attribute?

Copy link
Author

@SquidDev SquidDev Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had these as metric attributes. However, when adding cgroup support to this receiver, it felt that having a receiver per unit made a bit more sense, as the metrics are a bit more grouped together. This is what we do for the various container receivers (Docker, Podman, etc...) for example.

Copy link
Contributor

@pjanotti pjanotti Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit of a gray area, right now I'm not convinced yet that it is the better trade off here. That said the component is still in development so we can experiment and adjust as appropriate.

@atoulme
Copy link
Contributor

atoulme commented Oct 21, 2025

Please take a look at the failing changelog check. We can merge as is and address feedback afterwards.

@SquidDev
Copy link
Author

SquidDev commented Oct 21, 2025

Ahh, sorry! Thought I'd caught all the upstream changes that had gone in (had fixed the issues caused by a gofumpt bump), but clearly missed this one. All fixed!

@SquidDev
Copy link
Author

And of course there's another one! 😱

CONTRIBUTING.md currently tells you to manually run a dozen make commands when creating a PR, which I did! But then have not run all of them when resolving merge conflicts. I wonder if it would be worth recommending make checks instead (have only just discovered that), as it feels a bit easier to remember.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants