Skip to content

Possible failed kernel installation due to efa-nv-peermem kernel module and DKMS #6861

Open
@nyetsche

Description

@nyetsche

Rocky9.5 (+RHEL9?) instances in ParallelCluster can experience a failed kernel upgrade after creation (dnf update). The package install "succeeds" but the initrd (initramfs) is not created so the OS will not correctly boot.

I've tracked this down to the efa-kv-peermem kernel module when combined with DKMS >= 3.1.4.

The good news is that this is fixed in efa-nv-peermem 1.2.1: https://github.com/amzn/amzn-drivers/releases/tag/efa_nv_peermem_linux_1.2.1 ( commit amzn/amzn-drivers@182e083 ).

The bad news is that just released 3.13.1 installs efa_version 1.41.0, which is still on the 1.1.1 RPM (ex: aws-efa-installer/RPMS/ROCKYLINUX9/x86_64/efa-nv-peermem-1.1.1-1.el9.x86_64.rpm).

Also, that RPM/kernel module wouldn't be updated on an already running cluster (it's not in yum, it's an Chef recipe that installs from a tarball). A manual fix is to remove the BUILD_DEPENDS=nvidia line.

For anyone to reproduce/experiencing this, here's what you'll see in the dnf update output:

Autoinstall of module efa/2.10.0 for kernel 5.14.0-503.40.1.el9_5.x86_64 (x86_64)
Running the pre_build script.................................... done.
Building module(s)..... done.
Signing module /var/lib/dkms/efa/2.10.0/build/build/src/efa.ko
Found pre-existing /lib/modules/5.14.0-503.40.1.el9_5.x86_64/kernel/drivers/infiniband/hw/efa/efa.ko.xz, archiving for uninstallation
Installing /lib/modules/5.14.0-503.40.1.el9_5.x86_64/extra/efa.ko.xz
Running depmod.... done.

Autoinstall on 5.14.0-503.40.1.el9_5.x86_64 succeeded for module(s) efa.
efa-nv-peermem/1.1.1 autoinstall failed due to missing dependencies: nvidia.

Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
warning: %posttrans(kernel-core-5.14.0-503.40.1.el9_5.x86_64) scriptlet failed, exit status 11

It's in the %posttrans section. So this is after the files in the RPM are installed, and is a script that creates needed resources in /boot, and sets the boot kernel.

The dkms.conf for efa-nv-peermem has BUILD_DEPENDS="nvidia"; this BUILD_DEPENDS line wasn't a previously a problem until a change in dkms: The commit at dell/dkms@b3eaae1 (present starting in v3.1.4 on Dec 19th 2024). Previously when BUILD_DEPENDS was set but those dependencies not found, dkms would issue a warning but continue.

Our environment wrote a tiny check for if initrd does not exist but the kernel does; while I noticed it with efa-nv-peermem this change to DKMS would be a problem for any other autoinstall kernel module that fails a dependency.

test -f $(grubby --info=$(grubby --default-kernel) | awk -F= '/^initrd=/ { gsub(/"/, "", $2) ; print $2 }')

if it isn't found; generate with:

dracut --regenerate-all --force

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions