-
Notifications
You must be signed in to change notification settings - Fork 394
Description
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
When deploying the GPU operator v25.3.2 on OpenShift Container Plartform version 4.18.22, deploying the default clusterpolicy fails (notReady). The driver daemoset container completes buidling the drivers but the nvidia-operator-validator pod is in Init:CreateContainerError. oc describe pod on on the validator shows:
kubelet Started container toolkit-validation
Warning Failed 4m49s (x4 over 5m59s) kubelet Error: container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
Normal Pulled 3m31s (x7 over 6m13s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a" already present on machine
Warning BackOff 2m12s (x14 over 6m11s) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-kkpjf_nvidia-gpu-operator(3492f33c-e20f-47ea-9693-dc3304fefd84)
To Reproduce
- Deploy Single Node OCP 4.18.22 on AWS.
- Create a g4dn.xlarge (T4 GPU) machineset to add a worker node to this single node
- Deploy NFD operator and nodefeaturediscovery CR instance to label the worker nodes
- Deploy GPU Operator lasted v25.3.2
- Deploy clusterpolicy from CSV alam-examples
- after 10 mins check if GPU stack was deployed successfully, run cmd: oc get pods -n nvidia-gpu-operator
Expected behavior
GPU operator stack successfully deployed in nvidia-gpu-operator namespace such as:
$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-ssx5r 1/1 Running 0 29h
gpu-operator-5fcc456c94-wdq8x 1/1 Running 0 38h
nvidia-container-toolkit-daemonset-tc69k 1/1 Running 0 38h
nvidia-cuda-validator-kzfjn 0/1 Completed 0 54m
nvidia-dcgm-exporter-d2qrj 1/1 Running 0 38h
nvidia-dcgm-x6b6v 1/1 Running 0 38h
nvidia-device-plugin-daemonset-92hr4 1/1 Running 0 38h
nvidia-driver-daemonset-418.94.202508060022-0-wnrz2 2/2 Running 0 38h
nvidia-node-status-exporter-pr2zj 1/1 Running 0 38h
nvidia-operator-validator-v4j75 1/1 Running 0 29h
Environment (please provide the following information):
- GPU Operator Version: v25.3.2
- OS: RHCOS 418.94.202508060022-0 - based on RHEL 9.4
- Kernel Version: 5.14.0-427.81.1.el9_4.x86_64
- Container Runtime Version: crio version 1.31.11-2.rhaos4.18.git65ec77a.el9
- Kubernetes Distro and Version: OpenShift running kubernetes v1.31.11
Information to attach (optional if deemed irrelevant)
Nvidia-smi output form driver container daemonset:
sh-5.1# nvidia-smi
Tue Aug 12 03:27:29 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08 Driver Version: 570.148.08 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 32C P8 13W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
sh-5.1#
Additional notes:
Workaround that resolved this issue:
Added no-cgroups = true
to the GPU worker node /var/usrlocal/nvidia/toolkit/.config/nvidia-container-runtime/config.toml file, and the GPU stack is now deployed successfully on OCP 4.18.22.
[nvidia-container-cli]
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/run/nvidia/driver"
no-cgroups = true