Skip to content

Conversation

sohankunkerkar
Copy link
Member

@sohankunkerkar sohankunkerkar commented Aug 21, 2025

The current logic only sets traditional device rules when BPF is not attempted (*devices_set == false). However, systemd can fail internally to apply BPF programs due to permission issues, SELinux policies, or other environment-specific restrictions, while still accepting the BPF program via D-Bus without error.

xref: https://issues.redhat.com/browse/OCPBUGS-60663

Summary by Sourcery

Bug Fixes:

  • Always append fallback device rules when resources->devices is set, even if BPF programs were attempted

Copy link

sourcery-ai bot commented Aug 21, 2025

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Modified cgroup-systemd append_resources logic to always include traditional device rules as a fallback when BPF is in use, by removing the conditional check on devices_set and relying solely on update status and resource presence.

Class diagram for updated append_resources logic in cgroup-systemd

classDiagram
    class append_resources {
        +sd_bus_message *m
        +resources
        +err
        +is_update
        +devices_set
        +append_devices(m, resources, err)
        - Conditional logic for device rules
    }
    append_resources --> append_devices: calls
    class append_devices {
        +sd_bus_message *m
        +resources
        +err
    }
Loading

Flow diagram for device rule application in append_resources

flowchart TD
    A[Start append_resources] --> B{Is update or resources.devices present?}
    B -- Yes --> C[Call append_devices to set traditional device rules]
    B -- No --> D[Skip setting traditional device rules]
    C --> E[Continue with resource application]
    D --> E
Loading

File-Level Changes

Change Details Files
Always append traditional device rules as a fallback when using BPF
  • Removed the *devices_set check from the append_resources condition
  • Changed the if statement to trigger on !is_update or resources->devices
  • Ensured append_devices() is called in all relevant scenarios
src/libcrun/cgroup-systemd.c

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

…ing BPF

The current logic only sets traditional device rules when BPF is not
attempted (*devices_set == false). However, systemd can fail internally
to apply BPF programs due to permission issues, SELinux policies, or
other environment-specific restrictions, while still accepting the BPF
program via D-Bus without error.

Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

1 similar comment
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

# undef APPEND_UINT64_VALUE

if (! *devices_set && (! is_update || resources->devices))
if (! is_update || resources->devices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like it unconditionally does the backup, even if we added bpf program above. should we save whether we did and branch on that here as well?

Copy link
Member Author

@sohankunkerkar sohankunkerkar Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good catch!
Edit: Let me think out loud here about the tradeoffs:
Scenario 1: BPF not attempted

  • No BPF rules sent
  • Traditional rules set
  • Works

Scenario 2: BPF attempted, works in systemd

  • BPF rules + Traditional rules sent
  • systemd uses BPF (ignores traditional)
  • Works optimally

Scenario 3: BPF attempted, fails in systemd (NVIDIA case)

  • BPF rules + Traditional rules sent
  • systemd BPF fails internally
  • systemd falls back to traditional rules
  • Works (the current fix)

The core issue is a silent failure gap: crun successfully sends the BPF program to systemd via D-Bus (so from crun's perspective, BPF "succeeded"), but systemd can then fail internally to apply that BPF program due to permission issues, SELinux policies, etc. This internal failure isn't communicated back to crun.
Any flag we track would indicate "BPF was sent to systemd" rather than "BPF was successfully applied by systemd" - which means we'd still hit the same bug where we skip traditional rules based on incomplete information.

@kolyshkin
Copy link
Collaborator

IMHO the fix belongs to systemd, which should return an error (from StartTransientUnit) when bpf can not be applied. It's a one liner fix and we should have it.

Now, if that's the case, crun can retry again without bpf.

@giuseppe
Copy link
Member

are there cases (except the SELinux misconfiguration issue we had and the tiny additional step of installing the program in the bpf file system) where we would fail with BPFProgram but it succeeds if systemd does it?

The main advantage of BPFProgram was that we could drop the less flexible systemd way of setting the devices cgroup.

@sohankunkerkar
Copy link
Member Author

IMHO the fix belongs to systemd, which should return an error (from StartTransientUnit) when bpf can not be applied. It's a one liner fix and we should have it.

Now, if that's the case, crun can retry again without bpf.

Yeah, that makes sense. However, systemd allows units to start even if BPF programs fail. This fix follows crun's defensive programming style, handling failures gracefully without relying on external components.

are there cases (except the SELinux misconfiguration issue we had and the tiny additional step of installing the program in the bpf file system) where we would fail with BPFProgram but it succeeds if systemd does it?

There could be a few cases but I need to dig deeper on that side.

@sohankunkerkar
Copy link
Member Author

runc also replicates the similar logic:

1. generateDeviceProperties() → Creates DeviceAllow rules
2. setUnitProperties() → Sends DeviceAllow to systemd  
3. m.fsMgr.Set() → ALSO sets eBPF programs directly
4. Result: Both DeviceAllow + eBPF (duplication, but works!)

@giuseppe
Copy link
Member

IMHO the fix belongs to systemd, which should return an error (from StartTransientUnit) when bpf can not be applied. It's a one liner fix and we should have it.

Now, if that's the case, crun can retry again without bpf.

Yeah, that makes sense. However, systemd allows units to start even if BPF programs fail. This fix follows crun's defensive programming style, handling failures gracefully without relying on external components.

if the eBPF cannot be installed it must be reported to crun that will propagate the error.

Has the issue been reported to systemd?

@sohankunkerkar
Copy link
Member Author

IMHO the fix belongs to systemd, which should return an error (from StartTransientUnit) when bpf can not be applied. It's a one liner fix and we should have it.

Now, if that's the case, crun can retry again without bpf.

Yeah, that makes sense. However, systemd allows units to start even if BPF programs fail. This fix follows crun's defensive programming style, handling failures gracefully without relying on external components.

if the eBPF cannot be installed it must be reported to crun that will propagate the error.

Has the issue been reported to systemd?

I agree the systemd should report that error back to crun which is silently swallowing BPF failures here. A PR can be filed to address that issue, but in the meantime a stop-gap solution is required to handle the NVIDIA GPU Operator problem. After reverting the BPF changes and testing on OCP, the NVIDIA GPU Operator installs successfully and works as expected.

@giuseppe
Copy link
Member

I am fine if this is a temporary workaround, but having to set the devices controller in two different ways seems wrong to me; so we either use the BPFProgram (as I think we should), or we gave up on that and use the traditional systemd way of setting it up. This has some limitations though as it can't fully express the OCI specs.

Could you please file an issue for systemd and link it here?

@sohankunkerkar
Copy link
Member Author

Just a note:
Systemd failure:

systemd[1]: crio-*.scope: bpf-foreign: Failed to create foreign BPF program: Permission denied
systemd[1]: crio-*.scope: bpf-foreign: Failed to prepare foreign BPF hashmap: Permission denied

SELinux AVC Denials: (SELinux is blocking systemd from running BPF programs created by crun: container_runtime_t context)

type=AVC msg=audit(...): avc: denied { prog_run } for pid=1 comm="systemd" 
scontext=system_u:system_r:init_t:s0 
tcontext=system_u:system_r:container_runtime_t:s0 
tclass=bpf permissive=0

@sohankunkerkar
Copy link
Member Author

Should we handle that SELinux policy in https://github.com/containers/container-selinux?
allow init_t container_runtime_t:bpf prog_run; seems like a legit case where systemd managing container cgroups. I think policy should support standard container operations.
Thoughts?

@giuseppe
Copy link
Member

Should we handle that SELinux policy in https://github.com/containers/container-selinux?
allow init_t container_runtime_t:bpf prog_run; seems like a legit case where systemd managing container cgroups. I think policy should support standard container operations.
Thoughts?

isn't this already done with containers/container-selinux@eba6a1e ?

@sohankunkerkar
Copy link
Member Author

Should we handle that SELinux policy in https://github.com/containers/container-selinux?
allow init_t container_runtime_t:bpf prog_run; seems like a legit case where systemd managing container cgroups. I think policy should support standard container operations.
Thoughts?

isn't this already done with containers/container-selinux@eba6a1e ?

Ah, nice! I think the package hasn't been updated to the latest one (which has that fix).

@sohankunkerkar
Copy link
Member Author

Issue Summary and Path Forward:
After the NVIDIA GPU Operator failed to start (particularly the validator pods), I dug into the logs and discovered that eBPF permission denials were silently breaking crun's BPF device programs. This patch adds a fallback mechanism to set DeviceAllow rules when BPF fails, but unfortunately, the failure happens too early in the process for crun's fallback logic to even kick in.

I think @giuseppe's analysis in the bug report makes a complete sense. The NVIDIA Operator's runtime hook directly modifies cgroups, which fundamentally conflicts with how crun is designed to work. crun delegates cgroup management to systemd, so when the NVIDIA hook writes directly to cgroups, it's essentially working against crun's architecture.

This wasn't a problem historically because:
Pre-eBPF change(crun < 1.23):

crun handed off cgroup handling to systemd -> systemd used traditional DeviceAllow properties -> NVIDIA hook could modify cgroups directly -> Both approaches played nicely together

Post-eBPF change (crun ≥ 1.23):

crun attempts eBPF first (BPFProgram=device) -> If eBPF succeeds, devices_set = true -> Traditional DeviceAllow paths get skipped -> NVIDIA hook still modifies cgroups directly -> This creates conflicts with the eBPF approach

Here are the action items from this analysis:

  • For the BPF Permission Denials: The SELinux update in container-selinux commit eba6a1e should resolve this, but it needs to be backported to the RHEL packages.
  • For the crun Fallback mechanism: This PR ensures DeviceAllow rules are set when BPF fails. However, systemd currently fails silently on BPF operations, so crun never knows there was a problem. We should file a systemd bug to ensure BPF operation failures are properly reported back to crun and toggle this logic accordingly.
  • For the NVIDIA Operator Runtime Hook: The operator needs to change its approach to work with crun's systemd-delegated device management instead of directly manipulating cgroups.

Bottom Line:
This is really a three-layer problem that needs fixes at the SELinux/BPF permission level(already fixed; need a new release), systemd error reporting, and the NVIDIA operator implementation.

@kolyshkin
Copy link
Collaborator

runc also replicates the similar logic:

1. generateDeviceProperties() → Creates DeviceAllow rules
2. setUnitProperties() → Sends DeviceAllow to systemd  
3. m.fsMgr.Set() → ALSO sets eBPF programs directly
4. Result: Both DeviceAllow + eBPF (duplication, but works!)

runc does not use BPFProgram= (yet).

@sohankunkerkar
Copy link
Member Author

runc does not use BPFProgram= (yet).

Ah, good catch! Yeah, you are right. With systemd, you can only stick with DeviceAllow properties.

giuseppe added a commit to giuseppe/crun that referenced this pull request Sep 1, 2025
systemd ignores errors with BPFProgram so make sure the ebpf program
is installed when requested.

Alternative to containers#1854

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe
Copy link
Member

giuseppe commented Sep 1, 2025

opened an alternative PR: #1865

giuseppe added a commit to giuseppe/crun that referenced this pull request Sep 1, 2025
systemd ignores errors with BPFProgram so make sure the ebpf program
is installed when requested.

Alternative to containers#1854

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
giuseppe added a commit to giuseppe/crun that referenced this pull request Sep 1, 2025
systemd ignores errors with BPFProgram so make sure the ebpf program
is installed when requested.

Alternative to containers#1854

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@kolyshkin
Copy link
Collaborator

Closing in favor of #1865

@kolyshkin kolyshkin closed this Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants