-
Notifications
You must be signed in to change notification settings - Fork 377
cgroup-systemd: Always set traditional device rules as backup when using BPF #1854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideModified cgroup-systemd append_resources logic to always include traditional device rules as a fallback when BPF is in use, by removing the conditional check on devices_set and relying solely on update status and resource presence. Class diagram for updated append_resources logic in cgroup-systemdclassDiagram
class append_resources {
+sd_bus_message *m
+resources
+err
+is_update
+devices_set
+append_devices(m, resources, err)
- Conditional logic for device rules
}
append_resources --> append_devices: calls
class append_devices {
+sd_bus_message *m
+resources
+err
}
Flow diagram for device rule application in append_resourcesflowchart TD
A[Start append_resources] --> B{Is update or resources.devices present?}
B -- Yes --> C[Call append_devices to set traditional device rules]
B -- No --> D[Skip setting traditional device rules]
C --> E[Continue with resource application]
D --> E
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
…ing BPF The current logic only sets traditional device rules when BPF is not attempted (*devices_set == false). However, systemd can fail internally to apply BPF programs due to permission issues, SELinux policies, or other environment-specific restrictions, while still accepting the BPF program via D-Bus without error. Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
e664a1c
to
bb74129
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ephemeral COPR build failed. @containers/packit-build please check. |
1 similar comment
Ephemeral COPR build failed. @containers/packit-build please check. |
# undef APPEND_UINT64_VALUE | ||
|
||
if (! *devices_set && (! is_update || resources->devices)) | ||
if (! is_update || resources->devices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like it unconditionally does the backup, even if we added bpf program above. should we save whether we did and branch on that here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good catch!
Edit: Let me think out loud here about the tradeoffs:
Scenario 1: BPF not attempted
- No BPF rules sent
- Traditional rules set
- Works
Scenario 2: BPF attempted, works in systemd
- BPF rules + Traditional rules sent
- systemd uses BPF (ignores traditional)
- Works optimally
Scenario 3: BPF attempted, fails in systemd (NVIDIA case)
- BPF rules + Traditional rules sent
- systemd BPF fails internally
- systemd falls back to traditional rules
- Works (the current fix)
The core issue is a silent failure gap: crun successfully sends the BPF program to systemd via D-Bus (so from crun's perspective, BPF "succeeded"), but systemd can then fail internally to apply that BPF program due to permission issues, SELinux policies, etc. This internal failure isn't communicated back to crun.
Any flag we track would indicate "BPF was sent to systemd" rather than "BPF was successfully applied by systemd" - which means we'd still hit the same bug where we skip traditional rules based on incomplete information.
IMHO the fix belongs to systemd, which should return an error (from Now, if that's the case, crun can retry again without bpf. |
are there cases (except the SELinux misconfiguration issue we had and the tiny additional step of installing the program in the bpf file system) where we would fail with The main advantage of BPFProgram was that we could drop the less flexible systemd way of setting the devices cgroup. |
Yeah, that makes sense. However, systemd allows units to start even if BPF programs fail. This fix follows crun's defensive programming style, handling failures gracefully without relying on external components.
There could be a few cases but I need to dig deeper on that side. |
runc also replicates the similar logic: 1. generateDeviceProperties() → Creates DeviceAllow rules
2. setUnitProperties() → Sends DeviceAllow to systemd
3. m.fsMgr.Set() → ALSO sets eBPF programs directly
4. Result: Both DeviceAllow + eBPF (duplication, but works!) |
if the eBPF cannot be installed it must be reported to crun that will propagate the error. Has the issue been reported to systemd? |
I agree the systemd should report that error back to crun which is silently swallowing BPF failures here. A PR can be filed to address that issue, but in the meantime a stop-gap solution is required to handle the NVIDIA GPU Operator problem. After reverting the BPF changes and testing on OCP, the NVIDIA GPU Operator installs successfully and works as expected. |
I am fine if this is a temporary workaround, but having to set the devices controller in two different ways seems wrong to me; so we either use the BPFProgram (as I think we should), or we gave up on that and use the traditional systemd way of setting it up. This has some limitations though as it can't fully express the OCI specs. Could you please file an issue for systemd and link it here? |
Just a note: systemd[1]: crio-*.scope: bpf-foreign: Failed to create foreign BPF program: Permission denied
systemd[1]: crio-*.scope: bpf-foreign: Failed to prepare foreign BPF hashmap: Permission denied SELinux AVC Denials: (SELinux is blocking systemd from running BPF programs created by crun: type=AVC msg=audit(...): avc: denied { prog_run } for pid=1 comm="systemd"
scontext=system_u:system_r:init_t:s0
tcontext=system_u:system_r:container_runtime_t:s0
tclass=bpf permissive=0 |
Should we handle that SELinux policy in https://github.com/containers/container-selinux? |
isn't this already done with containers/container-selinux@eba6a1e ? |
Ah, nice! I think the package hasn't been updated to the latest one (which has that fix). |
Issue Summary and Path Forward: I think @giuseppe's analysis in the bug report makes a complete sense. The NVIDIA Operator's runtime hook directly modifies cgroups, which fundamentally conflicts with how crun is designed to work. crun delegates cgroup management to systemd, so when the NVIDIA hook writes directly to cgroups, it's essentially working against crun's architecture. This wasn't a problem historically because: crun handed off cgroup handling to systemd -> systemd used traditional DeviceAllow properties -> NVIDIA hook could modify cgroups directly -> Both approaches played nicely together Post-eBPF change (crun ≥ 1.23): crun attempts eBPF first (BPFProgram=device) -> If eBPF succeeds, devices_set = true -> Traditional DeviceAllow paths get skipped -> NVIDIA hook still modifies cgroups directly -> This creates conflicts with the eBPF approach Here are the action items from this analysis:
Bottom Line: |
runc does not use BPFProgram= (yet). |
Ah, good catch! Yeah, you are right. With systemd, you can only stick with DeviceAllow properties. |
systemd ignores errors with BPFProgram so make sure the ebpf program is installed when requested. Alternative to containers#1854 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
opened an alternative PR: #1865 |
systemd ignores errors with BPFProgram so make sure the ebpf program is installed when requested. Alternative to containers#1854 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
systemd ignores errors with BPFProgram so make sure the ebpf program is installed when requested. Alternative to containers#1854 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Closing in favor of #1865 |
The current logic only sets traditional device rules when BPF is not attempted (*devices_set == false). However, systemd can fail internally to apply BPF programs due to permission issues, SELinux policies, or other environment-specific restrictions, while still accepting the BPF program via D-Bus without error.
xref: https://issues.redhat.com/browse/OCPBUGS-60663
Summary by Sourcery
Bug Fixes: