|
1 |
| -## KMMO |
2 |
| -[Kernel Module Management (KMM) Operator](https://github.com/rh-ecosystem-edge/kernel-module-management) manages the deployment and lifecycle of out-of-tree kernel modules with OCP. |
3 |
| - |
4 |
| -In this Project, KMM Operator is used to manage Intel dGPU drivers container images deployment on day 2. |
5 |
| - |
6 |
| -Intel dGPU driver container images are released from [Intel Data Center GPU Driver for OpenShift Project](https://github.com/intel/intel-data-center-gpu-driver-for-openshift/tree/main/release#intel-data-center-gpu-driver-container-images-for-openshift-release) |
7 |
| - |
8 |
| -### KMM Operator Working Mode |
9 |
| - |
10 |
| -* Pre-build Mode: This is the default and recommended mode. KMMO will use pre-built, certified, and released driver container images from Red Hat Ecosystem Catalog to deploy Intel dGPU drivers. |
11 |
| - |
12 |
| -* On-premise Build Mode: With this mode, Users and build their own driver container image on-premise and then deploy it on the cluster. |
13 |
| - |
14 |
| -### Managing Intel dGPU driver with KMM Operator |
15 |
| - |
16 |
| -Below operations are verified on OCP-4.12 bare metal cluster. |
17 |
| - |
18 |
| -* Follow [KMMO operator installation guide](https://docs.openshift.com/container-platform/4.12/hardware_enablement/kmm-kernel-module-management.html#kmm-install-using-web-console_kernel-module-management-operator) to install the operator on OCP. |
19 |
| - |
20 |
| -* Intel dGPU driver Canary Deployment on OpenShift |
21 |
| - |
22 |
| -Canary deployment is used by default to deploy the driver only on the specific node(s). So Before depolying the driver on the nodes cluseter wide, user can get chance to verify the driver on these canary nodes. That can prevent the driver with some potential issues from damaging the cluster. |
23 |
| - |
24 |
| -label the nodes you want to run the canary deployment |
25 |
| - |
26 |
| -```$ oc label node dGPU_node_name intel.feature.node.kubernetes.io/dgpu-canary=true``` |
27 |
| - |
28 |
| -Note: `intel.feature.node.kubernetes.io/gpu=true` is labled by NFD to show that Intel dGPU card is detected on the node. See [NFD README](/nfd/README.md) |
29 |
| - |
30 |
| -* Using pre-build Mode to deploy the driver |
31 |
| - |
32 |
| -```$ oc apply -f https://github.com//intel/intel-technology-enabling-for-openshift/blob/main/kmmo/intel-dgpu.yaml``` |
33 |
| - |
34 |
| -* deploy the driver on all the nodes Cluster wide |
35 |
| - |
36 |
| -if the driver is running properly on the canay nodes, you can deploy the driver cluster wide. |
37 |
| - |
38 |
| -comments the line `intel.feature.node.kubernetes.io/dgpu-canary: 'true'` in the intel-dgpu.yamal file and run |
39 |
| - |
40 |
| -```$ oc apply -f https://github.com//intel/intel-technology-enabling-for-openshift/blob/main/kmmo/intel-dgpu.yaml``` |
41 |
| - |
42 |
| -### using On-premise Build Mode |
43 |
| - |
44 |
| -Prior to using this mode, run the following commands to create a `ConfigMap` and include the dockerfile to build the driver container image: |
45 |
| - |
46 |
| -```$ git clone https://github.com/intel/intel-data-center-gpu-driver-for-openshift.git && cd intel-data-center-GPU-driver-for-openshift/docker``` |
47 |
| - |
48 |
| -```$oc create -n openshift-kmm configmap intel-dgpu-dockerfile-configmap --from-file=dockerfile=intel-dgpu-driver.Dockerfile``` |
49 |
| - |
50 |
| -To use this mode, run the following command: |
51 |
| - |
52 |
| -```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/kmmo/intel-dgpu-on-premise-build-mode.yaml``` |
| 1 | +# Setting up Out of Tree Drivers |
| 2 | + |
| 3 | +# Introduction |
| 4 | +[Kernel module management (KMM) operator](https://github.com/rh-ecosystem-edge/kernel-module-management) manages the deployment and lifecycle of out-of-tree kernel modules on RHOCP. |
| 5 | + |
| 6 | +In this release, KMM operator is used to manage and deploy the Intel® Data Center GPU driver container image on the RHOCP cluster. |
| 7 | + |
| 8 | +Intel data center GPU driver container images are released from [Intel Data Center GPU Driver for OpenShift Project](https://github.com/intel/intel-data-center-gpu-driver-for-openshift/tree/main/release#intel-data-center-gpu-driver-container-images-for-openshift-release). |
| 9 | + |
| 10 | +# KMM operator working mode |
| 11 | +- **Pre-build mode** - This is the default and recommended mode. KMM Operator uses this pre-built and certified Intel Data Center GPU driver container image, which is published on the Red Hat Container Catalog to provision Intel Data Center GPUs on a RHOCP cluster. |
| 12 | +- **On-premises build mode** - Users can optionally build and deploy their own driver container images on-premises through the KMM operator. |
| 13 | + |
| 14 | +# Prerequisites |
| 15 | +- Provisioned RHOCP 4.12 cluster. Follow steps [here](/README.md#provisioning-rhocp-cluster). |
| 16 | +- Setup node feature discovery. Follow steps [here](/nfd/README.md). |
| 17 | +- Setup machine configuration Follow steps [here](/machine_configuration/README.md). |
| 18 | + |
| 19 | +# Install KMM operator |
| 20 | +Follow the installation guide below to install the KMM operator via CLI or web console. |
| 21 | +- [Install from CLI](https://docs.openshift.com/container-platform/4.12/hardware_enablement/kmm-kernel-module-management.html#kmm-install-using-cli_kernel-module-management-operator) |
| 22 | +- [Install from web console](https://docs.openshift.com/container-platform/4.12/hardware_enablement/kmm-kernel-module-management.html#kmm-install-using-web-console_kernel-module-management-operator) |
| 23 | + |
| 24 | +# Canary deployment with KMM |
| 25 | +Canary deployment is enabled by default to deploy the driver container image only on specific node(s) to ensure the initial deployment succeeds prior to rollout to all the eligible nodes in the cluster. This safety mechanism can reduce risk and prevent a deployment from adversely affecting the entire cluster. |
| 26 | + |
| 27 | +# Deploy Intel Data Center GPU Driver with pre-build mode |
| 28 | +Follow the steps below to deploy the driver container image with pre-build mode. |
| 29 | +1. Find all nodes with an Intel Data Center GPU card using the following command: |
| 30 | +``` |
| 31 | +$ oc get nodes -l intel.feature.node.kubernetes.io/gpu=true |
| 32 | +``` |
| 33 | +Example output: |
| 34 | +``` |
| 35 | +NAME STATUS ROLES AGE VERSION |
| 36 | +icx-dgpu-1 Ready worker 30d v1.25.4+18eadca |
| 37 | +``` |
| 38 | + |
| 39 | +2. Label the node(s) in the cluster using the command shown below for the initial canary deployment. |
| 40 | +``` |
| 41 | +$ oc label node <node_name> intel.feature.node.kubernetes.io/dgpu-canary=true |
| 42 | +``` |
| 43 | + |
| 44 | +3. Use pre-build mode to deploy the driver container. |
| 45 | +``` |
| 46 | +$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/1.0.0/kmmo/intel-dgpu.yaml |
| 47 | +``` |
| 48 | + |
| 49 | +4. After the driver is verified on the cluster through the canary deployment, simply remove the line shown below from the [`intel-dgpu.yaml`](/kmmo/intel-dgpu.yaml) file and reapply the yaml file to deploy the driver to the entire cluster. As a cluster administrator, you can also select another deployment policy. |
| 50 | +``` |
| 51 | +intel.feature.node.kubernetes.io/dgpu-canary: 'true' |
| 52 | +``` |
| 53 | + |
| 54 | +# Verification |
| 55 | +To verify that the drivers have been loaded, follow the steps below: |
| 56 | +1. List the nodes labeled with `kmm.node.kubernetes.io/intel-dgpu.ready` using the command shown below: |
| 57 | +``` |
| 58 | +$ oc get nodes -l kmm.node.kubernetes.io/intel-dgpu.ready |
| 59 | +``` |
| 60 | +Example output: |
| 61 | +``` |
| 62 | +NAME STATUS ROLES AGE VERSION |
| 63 | +icx-dgpu-1 Ready worker 30d v1.25.4+18eadca |
| 64 | +``` |
| 65 | +The label shown above indicates that the KMM operator has successfully deployed the drivers and firmware on the node. |
| 66 | + |
| 67 | +2. If you want to further debug the driver on the node, follow these steps: |
| 68 | + a. Navigate to the web console (Compute -> Nodes -> Select a node that has the GPU card -> Terminal). |
| 69 | + b. Run the commands shown below in the web console terminal: |
| 70 | + ``` |
| 71 | + $ chroot /host |
| 72 | + $ lsmod | grep i915 |
| 73 | + ``` |
| 74 | + Ensure `i915` and `intel_vsec` are loaded in the kernel, as shown in the output below: |
| 75 | + ``` |
| 76 | + i915 3633152 0 |
| 77 | + i915_compat 16384 1 i915 |
| 78 | + intel_vsec 16384 1 i915 |
| 79 | + intel_gtt 20480 1 i915 |
| 80 | + video 49152 1 i915 |
| 81 | + i2c_algo_bit 16384 1 i915 |
| 82 | + drm_kms_helper 290816 1 i915 |
| 83 | + drm 589824 3 drm_kms_helper,i915 |
| 84 | + dmabuf 77824 4 drm_kms_helper,i915,i915_compat,dr |
| 85 | + ``` |
| 86 | + c. Run dmesg to ensure there are no errors in the kernel message log. |
| 87 | + |
| 88 | +# See Also |
0 commit comments