Description
Rocky9.5 (+RHEL9?) instances in ParallelCluster can experience a failed kernel upgrade after creation (dnf update
). The package install "succeeds" but the initrd
(initramfs
) is not created so the OS will not correctly boot.
I've tracked this down to the efa-kv-peermem
kernel module when combined with DKMS >= 3.1.4.
The good news is that this is fixed in efa-nv-peermem 1.2.1: https://github.com/amzn/amzn-drivers/releases/tag/efa_nv_peermem_linux_1.2.1 ( commit amzn/amzn-drivers@182e083 ).
The bad news is that just released 3.13.1 installs efa_version
1.41.0
, which is still on the 1.1.1 RPM (ex: aws-efa-installer/RPMS/ROCKYLINUX9/x86_64/efa-nv-peermem-1.1.1-1.el9.x86_64.rpm
).
Also, that RPM/kernel module wouldn't be updated on an already running cluster (it's not in yum
, it's an Chef recipe that installs from a tarball). A manual fix is to remove the BUILD_DEPENDS=nvidia
line.
For anyone to reproduce/experiencing this, here's what you'll see in the dnf update
output:
Autoinstall of module efa/2.10.0 for kernel 5.14.0-503.40.1.el9_5.x86_64 (x86_64)
Running the pre_build script.................................... done.
Building module(s)..... done.
Signing module /var/lib/dkms/efa/2.10.0/build/build/src/efa.ko
Found pre-existing /lib/modules/5.14.0-503.40.1.el9_5.x86_64/kernel/drivers/infiniband/hw/efa/efa.ko.xz, archiving for uninstallation
Installing /lib/modules/5.14.0-503.40.1.el9_5.x86_64/extra/efa.ko.xz
Running depmod.... done.
Autoinstall on 5.14.0-503.40.1.el9_5.x86_64 succeeded for module(s) efa.
efa-nv-peermem/1.1.1 autoinstall failed due to missing dependencies: nvidia.
Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
warning: %posttrans(kernel-core-5.14.0-503.40.1.el9_5.x86_64) scriptlet failed, exit status 11
It's in the %posttrans
section. So this is after the files in the RPM are installed, and is a script that creates needed resources in /boot
, and sets the boot kernel.
The dkms.conf
for efa-nv-peermem
has BUILD_DEPENDS="nvidia"
; this BUILD_DEPENDS
line wasn't a previously a problem until a change in dkms
: The commit at dell/dkms@b3eaae1 (present starting in v3.1.4 on Dec 19th 2024). Previously when BUILD_DEPENDS
was set but those dependencies not found, dkms
would issue a warning but continue.
Our environment wrote a tiny check for if initrd
does not exist but the kernel does; while I noticed it with efa-nv-peermem
this change to DKMS would be a problem for any other autoinstall kernel module that fails a dependency.
test -f $(grubby --info=$(grubby --default-kernel) | awk -F= '/^initrd=/ { gsub(/"/, "", $2) ; print $2 }')
if it isn't found; generate with:
dracut --regenerate-all --force