Skip to content

Commit cd1da1f

Browse files
authored
Merge pull request #64 from chaitanya1731/doc2
kmmo: Updated Readme
2 parents c108682 + 1b90102 commit cd1da1f

File tree

1 file changed

+88
-52
lines changed

1 file changed

+88
-52
lines changed

kmmo/README.md

Lines changed: 88 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,88 @@
1-
## KMMO
2-
[Kernel Module Management (KMM) Operator](https://github.com/rh-ecosystem-edge/kernel-module-management) manages the deployment and lifecycle of out-of-tree kernel modules with OCP.
3-
4-
In this Project, KMM Operator is used to manage Intel dGPU drivers container images deployment on day 2.
5-
6-
Intel dGPU driver container images are released from [Intel Data Center GPU Driver for OpenShift Project](https://github.com/intel/intel-data-center-gpu-driver-for-openshift/tree/main/release#intel-data-center-gpu-driver-container-images-for-openshift-release)
7-
8-
### KMM Operator Working Mode
9-
10-
* Pre-build Mode: This is the default and recommended mode. KMMO will use pre-built, certified, and released driver container images from Red Hat Ecosystem Catalog to deploy Intel dGPU drivers.
11-
12-
* On-premise Build Mode: With this mode, Users and build their own driver container image on-premise and then deploy it on the cluster.
13-
14-
### Managing Intel dGPU driver with KMM Operator
15-
16-
Below operations are verified on OCP-4.12 bare metal cluster.
17-
18-
* Follow [KMMO operator installation guide](https://docs.openshift.com/container-platform/4.12/hardware_enablement/kmm-kernel-module-management.html#kmm-install-using-web-console_kernel-module-management-operator) to install the operator on OCP.
19-
20-
* Intel dGPU driver Canary Deployment on OpenShift
21-
22-
Canary deployment is used by default to deploy the driver only on the specific node(s). So Before depolying the driver on the nodes cluseter wide, user can get chance to verify the driver on these canary nodes. That can prevent the driver with some potential issues from damaging the cluster.
23-
24-
label the nodes you want to run the canary deployment
25-
26-
```$ oc label node dGPU_node_name intel.feature.node.kubernetes.io/dgpu-canary=true```
27-
28-
Note: `intel.feature.node.kubernetes.io/gpu=true` is labled by NFD to show that Intel dGPU card is detected on the node. See [NFD README](/nfd/README.md)
29-
30-
* Using pre-build Mode to deploy the driver
31-
32-
```$ oc apply -f https://github.com//intel/intel-technology-enabling-for-openshift/blob/main/kmmo/intel-dgpu.yaml```
33-
34-
* deploy the driver on all the nodes Cluster wide
35-
36-
if the driver is running properly on the canay nodes, you can deploy the driver cluster wide.
37-
38-
comments the line `intel.feature.node.kubernetes.io/dgpu-canary: 'true'` in the intel-dgpu.yamal file and run
39-
40-
```$ oc apply -f https://github.com//intel/intel-technology-enabling-for-openshift/blob/main/kmmo/intel-dgpu.yaml```
41-
42-
### using On-premise Build Mode
43-
44-
Prior to using this mode, run the following commands to create a `ConfigMap` and include the dockerfile to build the driver container image:
45-
46-
```$ git clone https://github.com/intel/intel-data-center-gpu-driver-for-openshift.git && cd intel-data-center-GPU-driver-for-openshift/docker```
47-
48-
```$oc create -n openshift-kmm configmap intel-dgpu-dockerfile-configmap --from-file=dockerfile=intel-dgpu-driver.Dockerfile```
49-
50-
To use this mode, run the following command:
51-
52-
```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/kmmo/intel-dgpu-on-premise-build-mode.yaml```
1+
# Setting up Out of Tree Drivers
2+
3+
# Introduction
4+
[Kernel module management (KMM) operator](https://github.com/rh-ecosystem-edge/kernel-module-management) manages the deployment and lifecycle of out-of-tree kernel modules on RHOCP.
5+
6+
In this release, KMM operator is used to manage and deploy the Intel® Data Center GPU driver container image on the RHOCP cluster.
7+
8+
Intel data center GPU driver container images are released from [Intel Data Center GPU Driver for OpenShift Project](https://github.com/intel/intel-data-center-gpu-driver-for-openshift/tree/main/release#intel-data-center-gpu-driver-container-images-for-openshift-release).
9+
10+
# KMM operator working mode
11+
- **Pre-build mode** - This is the default and recommended mode. KMM Operator uses this pre-built and certified Intel Data Center GPU driver container image, which is published on the Red Hat Container Catalog to provision Intel Data Center GPUs on a RHOCP cluster.
12+
- **On-premises build mode** - Users can optionally build and deploy their own driver container images on-premises through the KMM operator.
13+
14+
# Prerequisites
15+
- Provisioned RHOCP 4.12 cluster. Follow steps [here](/README.md#provisioning-rhocp-cluster).
16+
- Setup node feature discovery. Follow steps [here](/nfd/README.md).
17+
- Setup machine configuration Follow steps [here](/machine_configuration/README.md).
18+
19+
# Install KMM operator
20+
Follow the installation guide below to install the KMM operator via CLI or web console.
21+
- [Install from CLI](https://docs.openshift.com/container-platform/4.12/hardware_enablement/kmm-kernel-module-management.html#kmm-install-using-cli_kernel-module-management-operator)
22+
- [Install from web console](https://docs.openshift.com/container-platform/4.12/hardware_enablement/kmm-kernel-module-management.html#kmm-install-using-web-console_kernel-module-management-operator)
23+
24+
# Canary deployment with KMM
25+
Canary deployment is enabled by default to deploy the driver container image only on specific node(s) to ensure the initial deployment succeeds prior to rollout to all the eligible nodes in the cluster. This safety mechanism can reduce risk and prevent a deployment from adversely affecting the entire cluster.
26+
27+
# Deploy Intel Data Center GPU Driver with pre-build mode
28+
Follow the steps below to deploy the driver container image with pre-build mode.
29+
1. Find all nodes with an Intel Data Center GPU card using the following command:
30+
```
31+
$ oc get nodes -l intel.feature.node.kubernetes.io/gpu=true
32+
```
33+
Example output:
34+
```
35+
NAME STATUS ROLES AGE VERSION
36+
icx-dgpu-1 Ready worker 30d v1.25.4+18eadca
37+
```
38+
39+
2. Label the node(s) in the cluster using the command shown below for the initial canary deployment.
40+
```
41+
$ oc label node <node_name> intel.feature.node.kubernetes.io/dgpu-canary=true
42+
```
43+
44+
3. Use pre-build mode to deploy the driver container.
45+
```
46+
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/1.0.0/kmmo/intel-dgpu.yaml
47+
```
48+
49+
4. After the driver is verified on the cluster through the canary deployment, simply remove the line shown below from the [`intel-dgpu.yaml`](/kmmo/intel-dgpu.yaml) file and reapply the yaml file to deploy the driver to the entire cluster. As a cluster administrator, you can also select another deployment policy.
50+
```
51+
intel.feature.node.kubernetes.io/dgpu-canary: 'true'
52+
```
53+
54+
# Verification
55+
To verify that the drivers have been loaded, follow the steps below:
56+
1. List the nodes labeled with `kmm.node.kubernetes.io/intel-dgpu.ready` using the command shown below:
57+
```
58+
$ oc get nodes -l kmm.node.kubernetes.io/intel-dgpu.ready
59+
```
60+
Example output:
61+
```
62+
NAME STATUS ROLES AGE VERSION
63+
icx-dgpu-1 Ready worker 30d v1.25.4+18eadca
64+
```
65+
The label shown above indicates that the KMM operator has successfully deployed the drivers and firmware on the node.
66+
67+
2. If you want to further debug the driver on the node, follow these steps:
68+
a. Navigate to the web console (Compute -> Nodes -> Select a node that has the GPU card -> Terminal).
69+
b. Run the commands shown below in the web console terminal:
70+
```
71+
$ chroot /host
72+
$ lsmod | grep i915
73+
```
74+
Ensure `i915` and `intel_vsec` are loaded in the kernel, as shown in the output below:
75+
```
76+
i915 3633152 0
77+
i915_compat 16384 1 i915
78+
intel_vsec 16384 1 i915
79+
intel_gtt 20480 1 i915
80+
video 49152 1 i915
81+
i2c_algo_bit 16384 1 i915
82+
drm_kms_helper 290816 1 i915
83+
drm 589824 3 drm_kms_helper,i915
84+
dmabuf 77824 4 drm_kms_helper,i915,i915_compat,dr
85+
```
86+
c. Run dmesg to ensure there are no errors in the kernel message log.
87+
88+
# See Also

0 commit comments

Comments
 (0)