OCP 4.18.22: nvidia-operator-validator pod in Init:CreateContainerError  - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)

_**Important Note:  NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._

**Describe the bug**
When deploying the GPU operator v25.3.2 on OpenShift Container Plartform version 4.18.22, deploying the default clusterpolicy fails (notReady).  The driver daemoset container completes buidling the drivers but the  nvidia-operator-validator pod is in Init:CreateContainerError.  oc describe pod on on the validator shows:

```
kubelet            Started container toolkit-validation
  Warning  Failed          4m49s (x4 over 5m59s)   kubelet            Error: container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
  Normal   Pulled          3m31s (x7 over 6m13s)   kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a" already present on machine
  Warning  BackOff         2m12s (x14 over 6m11s)  kubelet            Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-kkpjf_nvidia-gpu-operator(3492f33c-e20f-47ea-9693-dc3304fefd84)
```

**To Reproduce**
- Deploy Single Node OCP 4.18.22 on AWS.  
-  Create a g4dn.xlarge (T4 GPU) machineset to add a worker node to this single node
- Deploy NFD operator and nodefeaturediscovery CR instance to label the worker nodes
- Deploy GPU Operator lasted v25.3.2
-  Deploy clusterpolicy from CSV alam-examples
- after 10 mins check if GPU stack was deployed successfully, run cmd:  oc get pods -n nvidia-gpu-operator

**Expected behavior**
GPU operator stack successfully deployed in nvidia-gpu-operator namespace such as:
```
$ oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-ssx5r                           1/1     Running     0          29h
gpu-operator-5fcc456c94-wdq8x                         1/1     Running     0          38h
nvidia-container-toolkit-daemonset-tc69k              1/1     Running     0          38h
nvidia-cuda-validator-kzfjn                           0/1     Completed   0          54m
nvidia-dcgm-exporter-d2qrj                            1/1     Running     0          38h
nvidia-dcgm-x6b6v                                     1/1     Running     0          38h
nvidia-device-plugin-daemonset-92hr4                  1/1     Running     0          38h
nvidia-driver-daemonset-418.94.202508060022-0-wnrz2   2/2     Running     0          38h
nvidia-node-status-exporter-pr2zj                     1/1     Running     0          38h
nvidia-operator-validator-v4j75                       1/1     Running     0          29h

```

**Environment (please provide the following information):**
 - GPU Operator Version: v25.3.2
 - OS: RHCOS 418.94.202508060022-0 - based on RHEL 9.4
 - Kernel Version: 5.14.0-427.81.1.el9_4.x86_64
 - Container Runtime Version: crio version 1.31.11-2.rhaos4.18.git65ec77a.el9
 - Kubernetes Distro and Version: OpenShift running kubernetes v1.31.11



**Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/)** (optional if deemed irrelevant)

must-gather logs:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/rh-ecosystem-edge_nvidia-ci/268/pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.18-stable-nvidia-gpu-operator-e2e-25-3-x/1954619565199593472/artifacts/nvidia-gpu-operator-e2e-25-3-x/gpu-operator-e2e/artifacts/gpu-operator-tests-must-gather/gpu-must-gather/


Nvidia-smi output form driver container daemonset:
```
sh-5.1# nvidia-smi
Tue Aug 12 03:27:29 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   32C    P8             13W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
sh-5.1# 
```

Additional notes:

Workaround that resolved this issue:

https://github.com/containers/podman/discussions/16101

Added `no-cgroups = true` to the GPU worker node /var/usrlocal/nvidia/toolkit/.config/nvidia-container-runtime/config.toml file, and the GPU stack is now deployed successfully on OCP 4.18.22.
```
[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"
  no-cgroups = true
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCP 4.18.22: nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1) #1598

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OCP 4.18.22: nvidia-operator-validator pod in Init:CreateContainerError - error executing hook /usr/local/nvidia/toolkit/nvidia-container-runtime-hook (exit code: 1) #1598

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

OCP 4.18.22: nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1) #1598