|
1 |
| -## Machine Configuration |
2 |
| - |
3 |
| -Machine Configuration is used to configure the RHCOS on each node in a OCP cluster. |
4 |
| - |
5 |
| -[Machine Configuration Operator](https://github.com/openshift/machine-config-operator) (MCO) is provided by Red Hat to handle machine configuration. |
6 |
| - |
7 |
| -### Prerequisites: |
8 |
| - |
9 |
| -Make sure NFD operator has been installed and configured properly |
10 |
| - |
11 |
| -Please refer to [instructions](/nfd/README.md#steps-to-install-and-configure-nfd-operator-on-ocp-cluster) to install and configure NFD operator on OCP Cluster. |
12 |
| - |
13 |
| -### General Configuration for Provisioning Intel Hardware Features |
14 |
| -* Set up an alternative firmware path for the Cluster |
15 |
| - |
16 |
| -The following command sets `/var/lib/firmware` as an alternative firmware path for RHCOS kernel to load OOT firmware. Because the default firmware directory `/lib/firmware` is mounted as read-only directory on OCP cluster. |
17 |
| - |
18 |
| -```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/100-alternative-fw-path-for-worker-nodes.yaml``` |
19 |
| - |
20 |
| -Note: This command might trigger nodes rebooting in the worker machine configuration pool. It is not preferred by this project. A better solution is in the plan. |
21 |
| - |
22 |
| -### Machine Configuration for Provisioning Intel dGPU |
23 |
| -* Create dGPU machine config pool: |
24 |
| - |
25 |
| -Below command creates a custom machine config pool for worker nodes with Intel dGPU card. These nodes are labeled `intel.feature.node.kubernetes.io/gpu: 'true'` by NFD. |
26 |
| - |
27 |
| -```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/intel-dgpu-machine-config-pool.yaml``` |
28 |
| - |
29 |
| -* Disable in-tree ast driver |
30 |
| - |
31 |
| -```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/100-intel-dgpu-machine-config-disable-ast.yaml``` |
32 |
| - |
33 |
| -Becuase of the known incompatible issue from intel i915 driver, This command is used to blacklist (disable loading) of ast driver on worker nodes in Intel dGPU machine config pool. This is needed prior to loading Intel dGPU drivers and firmware. |
34 |
| - |
35 |
| -Note: This known incompatible issue will be fixed when upgrade to intel data center gpud driver - 1.0.0 using RHEL-9.x i915 driver |
36 |
| - |
37 |
| -### Machine Configuration for Provisioning Intel QAT |
38 |
| - |
39 |
| -* Create QAT machine config pool: |
40 |
| - |
41 |
| -Following command creates a custom machine config pool for worker nodes with Intel QAT feature. These nodes are labeled `intel.feature.node.kubernetes.io/qat: 'true'` by NFD. |
42 |
| - |
43 |
| -```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/intel-qat-machine-config-pool.yaml``` |
44 |
| - |
45 |
| -* Turn on intel_iommu kernel parameter for QAT |
46 |
| - |
47 |
| -For QAT, intel_iommu is a kernel parameter that needs to be turned on using the following command. |
48 |
| - |
49 |
| -```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/100-intel-qat-intel-iommu-on.yaml``` |
50 |
| - |
51 |
| -Note: This will reboot the worker nodes in the Intel QAT machine config pool one by one to turn on intel_iommu kernel parameter. Rebooting the node is not preferred A better solution is in the plan. |
| 1 | +# Setting up Machine Configuration |
| 2 | + |
| 3 | +# Introduction |
| 4 | +Machine configuration operation is used to configure [Red Hat Enterprise Linux CoreOS (RHCOS)](https://docs.openshift.com/container-platform/4.12/architecture/architecture-rhcos.html) on each node in a RHOCP cluster. |
| 5 | + |
| 6 | +[Machine config operator](https://github.com/openshift/machine-config-operator) (MCO) is provided by Red Hat to manage the operating system and machine configuration. In this project through the MCO, cluster administrators can configure and update the kernel to provision Intel Hardware features on the worker nodes. |
| 7 | + |
| 8 | +MCO is one of the technologies used in this project to manage the machine configuration. In current OCP-4.12, MCO might reboot the node to enable the machine configuration. Since rebooting the node is undesirable, alternative machine configuration technologies are under investigation. For more details, see this [issue](https://github.com/intel/intel-technology-enabling-for-openshift/issues/34). |
| 9 | + |
| 10 | +The best approach is to work with the RHCOS team to push the RHCOS configuration as the default configuration for a RHOCP cluster on [Day 0](https://www.ibm.com/cloud/architecture/content/course/red-hat-openshift-container-platform-day-2-ops/). |
| 11 | + |
| 12 | +For some general configuration, we recommend you set it up while provisioning the cluster on [Day 1](https://www.ibm.com/cloud/architecture/content/course/red-hat-openshift-container-platform-day-2-ops/). |
| 13 | + |
| 14 | +If the configuration cannot be set as the default setting, we recommend using some operator to set the configuration on the fly without rebooting the node on [Day 2](https://www.ibm.com/cloud/architecture/content/course/red-hat-openshift-container-platform-day-2-ops/). |
| 15 | + |
| 16 | +Any contribution in this area is welcome. |
| 17 | + |
| 18 | +# Prerequisites |
| 19 | +- Provisioned RHOCP 4.12 cluster. Follow steps [here](/README.md#provisioning-rhocp-cluster). |
| 20 | +- Setup node feature discovery (NFD). Follow steps [here](/nfd/README.md). |
| 21 | + |
| 22 | +# General configuration |
| 23 | + |
| 24 | +## Set up an alternative firmware path for the cluster |
| 25 | +The command below sets `/var/lib/firmware` as the alternative firmware path since the default firmware path `/lib/firmware` is read-only on RHCOS. |
| 26 | +``` |
| 27 | +$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/blob/release-1.0.0/machine_configuration/100-alternative-fw-path-for-worker-nodes.yaml/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/100-alternative-fw-path-for-worker-nodes.yaml |
| 28 | +``` |
| 29 | +**Note**: This command reboots all the worker nodes sequentially. To avoid reboot on Day 2, the cluster administrator can perform this step on Day 1. |
| 30 | + |
| 31 | +## Verification |
| 32 | +### From the web console |
| 33 | +Navigate to the Compute -> MachineConfigPools section to check the status and make sure the MachineConfigPool update is complete. The status will update from `Updating` to `Up to date`. |
| 34 | + |
| 35 | +### From CLI |
| 36 | +``` |
| 37 | +$ oc get mcp |
| 38 | +``` |
| 39 | +Output: |
| 40 | +``` |
| 41 | +NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT |
| 42 | +worker rendered-worker-20c8785dee44f52d159fa1c04eeb8552 True False False 1 |
| 43 | +``` |
| 44 | + |
| 45 | +## Verify the alternative firmware path |
| 46 | +Navigate to the node terminal on the web console (Compute -> Nodes -> Select a node -> Terminal). Run the following command in the terminal: |
| 47 | +``` |
| 48 | +$ cat /proc/cmdline |
| 49 | +``` |
| 50 | +Ensure `firmware_class.path=/var/lib/firmware` is present. |
| 51 | + |
| 52 | +# Machine configuration for Intel® Data Center GPU |
| 53 | +## Create `intel-dgpu` MachineConfigPool |
| 54 | +The command shown below creates a custom `intel-dgpu` MachineConfigPool for worker nodes with an Intel Data Center GPU card, which is labeled with `intel.feature.node.kubernetes.io/gpu: 'true'` by [NFD](/nfd/README.md). |
| 55 | +``` |
| 56 | +$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/release-1.0.0/machine_configuration/intel-dgpu-machine-config-pool.yaml |
| 57 | +``` |
| 58 | + |
| 59 | +## Verification |
| 60 | +### From the web console |
| 61 | +Navigate to the Compute -> MachineConfigPools section and ensure `intel-dgpu` MachineConfigPool is present. |
| 62 | +### From the CLI |
| 63 | +``` |
| 64 | +$ oc get mcp |
| 65 | +``` |
| 66 | +Output: |
| 67 | +``` |
| 68 | +NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT |
| 69 | +intel-dgpu rendered-intel-dgpu-58fb5f4d72fe6041abb066880e112acd True False False 1 |
| 70 | +``` |
| 71 | +Ensure `intel-dgpu` MachineConfigPool is present. |
| 72 | + |
| 73 | +# Disable conflicting driver |
| 74 | +Run the command shown below to disable the loading of a potential conflicting driver, such as `ast` driver. |
| 75 | + |
| 76 | +**Note**: The `i915` driver depends on a ported `drm` module. Some other drivers, such as ast that depends on in-tree drm module might have a compatibility issues, known issue will be resolved on i915 driver for RHEL `9.x`, which will be used for RHOCP `4.13`. |
| 77 | +``` |
| 78 | +$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/release-1.0.0/machine_configuration/100-intel-dgpu-machine-config-disable-ast.yaml |
| 79 | +``` |
| 80 | +**Note**: This command will reboot the worker nodes in the `intel-dgpu` MachineConfigPool sequentially. |
| 81 | + |
| 82 | +## Verification |
| 83 | +Navigate to the node terminal on the web console (Compute -> Nodes -> Select a node -> Terminal). Run the following commands in the terminal. |
| 84 | +``` |
| 85 | +$ chroot /host |
| 86 | +$ lsmod | grep ast |
| 87 | +``` |
| 88 | +Ensure that ast driver is not loaded. |
| 89 | + |
| 90 | +# See Also |
| 91 | +- [Firmware Search Path](https://docs.kernel.org/driver-api/firmware/fw_search_path.html) |
| 92 | +- [Red Hat OpenShift Container Platform Day-2 operations](https://www.ibm.com/cloud/architecture/content/course/red-hat-openshift-container-platform-day-2-ops/) |
0 commit comments