Skip to content

Commit ca1be89

Browse files
authored
Merge pull request #63 from chaitanya1731/doc1
machine_configuration: Updated readme
2 parents 0389fc1 + c6e4813 commit ca1be89

File tree

1 file changed

+92
-51
lines changed

1 file changed

+92
-51
lines changed

machine_configuration/README.md

Lines changed: 92 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,92 @@
1-
## Machine Configuration
2-
3-
Machine Configuration is used to configure the RHCOS on each node in a OCP cluster.
4-
5-
[Machine Configuration Operator](https://github.com/openshift/machine-config-operator) (MCO) is provided by Red Hat to handle machine configuration.
6-
7-
### Prerequisites:
8-
9-
Make sure NFD operator has been installed and configured properly
10-
11-
Please refer to [instructions](/nfd/README.md#steps-to-install-and-configure-nfd-operator-on-ocp-cluster) to install and configure NFD operator on OCP Cluster.
12-
13-
### General Configuration for Provisioning Intel Hardware Features
14-
* Set up an alternative firmware path for the Cluster
15-
16-
The following command sets `/var/lib/firmware` as an alternative firmware path for RHCOS kernel to load OOT firmware. Because the default firmware directory `/lib/firmware` is mounted as read-only directory on OCP cluster.
17-
18-
```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/100-alternative-fw-path-for-worker-nodes.yaml```
19-
20-
Note: This command might trigger nodes rebooting in the worker machine configuration pool. It is not preferred by this project. A better solution is in the plan.
21-
22-
### Machine Configuration for Provisioning Intel dGPU
23-
* Create dGPU machine config pool:
24-
25-
Below command creates a custom machine config pool for worker nodes with Intel dGPU card. These nodes are labeled `intel.feature.node.kubernetes.io/gpu: 'true'` by NFD.
26-
27-
```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/intel-dgpu-machine-config-pool.yaml```
28-
29-
* Disable in-tree ast driver
30-
31-
```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/100-intel-dgpu-machine-config-disable-ast.yaml```
32-
33-
Becuase of the known incompatible issue from intel i915 driver, This command is used to blacklist (disable loading) of ast driver on worker nodes in Intel dGPU machine config pool. This is needed prior to loading Intel dGPU drivers and firmware.
34-
35-
Note: This known incompatible issue will be fixed when upgrade to intel data center gpud driver - 1.0.0 using RHEL-9.x i915 driver
36-
37-
### Machine Configuration for Provisioning Intel QAT
38-
39-
* Create QAT machine config pool:
40-
41-
Following command creates a custom machine config pool for worker nodes with Intel QAT feature. These nodes are labeled `intel.feature.node.kubernetes.io/qat: 'true'` by NFD.
42-
43-
```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/intel-qat-machine-config-pool.yaml```
44-
45-
* Turn on intel_iommu kernel parameter for QAT
46-
47-
For QAT, intel_iommu is a kernel parameter that needs to be turned on using the following command.
48-
49-
```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/100-intel-qat-intel-iommu-on.yaml```
50-
51-
Note: This will reboot the worker nodes in the Intel QAT machine config pool one by one to turn on intel_iommu kernel parameter. Rebooting the node is not preferred A better solution is in the plan.
1+
# Setting up Machine Configuration
2+
3+
# Introduction
4+
Machine configuration operation is used to configure [Red Hat Enterprise Linux CoreOS (RHCOS)](https://docs.openshift.com/container-platform/4.12/architecture/architecture-rhcos.html) on each node in a RHOCP cluster.
5+
6+
[Machine config operator](https://github.com/openshift/machine-config-operator) (MCO) is provided by Red Hat to manage the operating system and machine configuration. In this project through the MCO, cluster administrators can configure and update the kernel to provision Intel Hardware features on the worker nodes.
7+
8+
MCO is one of the technologies used in this project to manage the machine configuration. In current OCP-4.12, MCO might reboot the node to enable the machine configuration. Since rebooting the node is undesirable, alternative machine configuration technologies are under investigation. For more details, see this [issue](https://github.com/intel/intel-technology-enabling-for-openshift/issues/34).
9+
10+
The best approach is to work with the RHCOS team to push the RHCOS configuration as the default configuration for a RHOCP cluster on [Day 0](https://www.ibm.com/cloud/architecture/content/course/red-hat-openshift-container-platform-day-2-ops/).
11+
12+
For some general configuration, we recommend you set it up while provisioning the cluster on [Day 1](https://www.ibm.com/cloud/architecture/content/course/red-hat-openshift-container-platform-day-2-ops/).
13+
14+
If the configuration cannot be set as the default setting, we recommend using some operator to set the configuration on the fly without rebooting the node on [Day 2](https://www.ibm.com/cloud/architecture/content/course/red-hat-openshift-container-platform-day-2-ops/).
15+
16+
Any contribution in this area is welcome.
17+
18+
# Prerequisites
19+
- Provisioned RHOCP 4.12 cluster. Follow steps [here](/README.md#provisioning-rhocp-cluster).
20+
- Setup node feature discovery (NFD). Follow steps [here](/nfd/README.md).
21+
22+
# General configuration
23+
24+
## Set up an alternative firmware path for the cluster
25+
The command below sets `/var/lib/firmware` as the alternative firmware path since the default firmware path `/lib/firmware` is read-only on RHCOS.
26+
```
27+
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/blob/release-1.0.0/machine_configuration/100-alternative-fw-path-for-worker-nodes.yaml/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/100-alternative-fw-path-for-worker-nodes.yaml
28+
```
29+
**Note**: This command reboots all the worker nodes sequentially. To avoid reboot on Day 2, the cluster administrator can perform this step on Day 1.
30+
31+
## Verification
32+
### From the web console
33+
Navigate to the Compute -> MachineConfigPools section to check the status and make sure the MachineConfigPool update is complete. The status will update from `Updating` to `Up to date`.
34+
35+
### From CLI
36+
```
37+
$ oc get mcp
38+
```
39+
Output:
40+
```
41+
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT
42+
worker rendered-worker-20c8785dee44f52d159fa1c04eeb8552 True False False 1
43+
```
44+
45+
## Verify the alternative firmware path
46+
Navigate to the node terminal on the web console (Compute -> Nodes -> Select a node -> Terminal). Run the following command in the terminal:
47+
```
48+
$ cat /proc/cmdline
49+
```
50+
Ensure `firmware_class.path=/var/lib/firmware` is present.
51+
52+
# Machine configuration for Intel® Data Center GPU
53+
## Create `intel-dgpu` MachineConfigPool
54+
The command shown below creates a custom `intel-dgpu` MachineConfigPool for worker nodes with an Intel Data Center GPU card, which is labeled with `intel.feature.node.kubernetes.io/gpu: 'true'` by [NFD](/nfd/README.md).
55+
```
56+
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/release-1.0.0/machine_configuration/intel-dgpu-machine-config-pool.yaml
57+
```
58+
59+
## Verification
60+
### From the web console
61+
Navigate to the Compute -> MachineConfigPools section and ensure `intel-dgpu` MachineConfigPool is present.
62+
### From the CLI
63+
```
64+
$ oc get mcp
65+
```
66+
Output:
67+
```
68+
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT
69+
intel-dgpu rendered-intel-dgpu-58fb5f4d72fe6041abb066880e112acd True False False 1
70+
```
71+
Ensure `intel-dgpu` MachineConfigPool is present.
72+
73+
# Disable conflicting driver
74+
Run the command shown below to disable the loading of a potential conflicting driver, such as `ast` driver.
75+
76+
**Note**: The `i915` driver depends on a ported `drm` module. Some other drivers, such as ast that depends on in-tree drm module might have a compatibility issues, known issue will be resolved on i915 driver for RHEL `9.x`, which will be used for RHOCP `4.13`.
77+
```
78+
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/release-1.0.0/machine_configuration/100-intel-dgpu-machine-config-disable-ast.yaml
79+
```
80+
**Note**: This command will reboot the worker nodes in the `intel-dgpu` MachineConfigPool sequentially.
81+
82+
## Verification
83+
Navigate to the node terminal on the web console (Compute -> Nodes -> Select a node -> Terminal). Run the following commands in the terminal.
84+
```
85+
$ chroot /host
86+
$ lsmod | grep ast
87+
```
88+
Ensure that ast driver is not loaded.
89+
90+
# See Also
91+
- [Firmware Search Path](https://docs.kernel.org/driver-api/firmware/fw_search_path.html)
92+
- [Red Hat OpenShift Container Platform Day-2 operations](https://www.ibm.com/cloud/architecture/content/course/red-hat-openshift-container-platform-day-2-ops/)

0 commit comments

Comments
 (0)