Skip to content

OCP 4.18.22: nvidia-operator-validator pod in Init:CreateContainerError - error executing hook /usr/local/nvidia/toolkit/nvidia-container-runtime-hook (exit code: 1) #1598

@wabouhamad

Description

@wabouhamad

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug
When deploying the GPU operator v25.3.2 on OpenShift Container Plartform version 4.18.22, deploying the default clusterpolicy fails (notReady). The driver daemoset container completes buidling the drivers but the nvidia-operator-validator pod is in Init:CreateContainerError. oc describe pod on on the validator shows:

kubelet            Started container toolkit-validation
  Warning  Failed          4m49s (x4 over 5m59s)   kubelet            Error: container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
  Normal   Pulled          3m31s (x7 over 6m13s)   kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a" already present on machine
  Warning  BackOff         2m12s (x14 over 6m11s)  kubelet            Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-kkpjf_nvidia-gpu-operator(3492f33c-e20f-47ea-9693-dc3304fefd84)

To Reproduce

  • Deploy Single Node OCP 4.18.22 on AWS.
  • Create a g4dn.xlarge (T4 GPU) machineset to add a worker node to this single node
  • Deploy NFD operator and nodefeaturediscovery CR instance to label the worker nodes
  • Deploy GPU Operator lasted v25.3.2
  • Deploy clusterpolicy from CSV alam-examples
  • after 10 mins check if GPU stack was deployed successfully, run cmd: oc get pods -n nvidia-gpu-operator

Expected behavior
GPU operator stack successfully deployed in nvidia-gpu-operator namespace such as:

$ oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-ssx5r                           1/1     Running     0          29h
gpu-operator-5fcc456c94-wdq8x                         1/1     Running     0          38h
nvidia-container-toolkit-daemonset-tc69k              1/1     Running     0          38h
nvidia-cuda-validator-kzfjn                           0/1     Completed   0          54m
nvidia-dcgm-exporter-d2qrj                            1/1     Running     0          38h
nvidia-dcgm-x6b6v                                     1/1     Running     0          38h
nvidia-device-plugin-daemonset-92hr4                  1/1     Running     0          38h
nvidia-driver-daemonset-418.94.202508060022-0-wnrz2   2/2     Running     0          38h
nvidia-node-status-exporter-pr2zj                     1/1     Running     0          38h
nvidia-operator-validator-v4j75                       1/1     Running     0          29h

Environment (please provide the following information):

  • GPU Operator Version: v25.3.2
  • OS: RHCOS 418.94.202508060022-0 - based on RHEL 9.4
  • Kernel Version: 5.14.0-427.81.1.el9_4.x86_64
  • Container Runtime Version: crio version 1.31.11-2.rhaos4.18.git65ec77a.el9
  • Kubernetes Distro and Version: OpenShift running kubernetes v1.31.11

Information to attach (optional if deemed irrelevant)

must-gather logs:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/rh-ecosystem-edge_nvidia-ci/268/pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.18-stable-nvidia-gpu-operator-e2e-25-3-x/1954619565199593472/artifacts/nvidia-gpu-operator-e2e-25-3-x/gpu-operator-e2e/artifacts/gpu-operator-tests-must-gather/gpu-must-gather/

Nvidia-smi output form driver container daemonset:

sh-5.1# nvidia-smi
Tue Aug 12 03:27:29 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   32C    P8             13W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
sh-5.1# 

Additional notes:

Workaround that resolved this issue:

containers/podman#16101

Added no-cgroups = true to the GPU worker node /var/usrlocal/nvidia/toolkit/.config/nvidia-container-runtime/config.toml file, and the GPU stack is now deployed successfully on OCP 4.18.22.

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"
  no-cgroups = true

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions