Skip to content

Commit 1e3bb58

Browse files
authored
Merge pull request #354 from chaitanya1731/gaudi-updates
gaudi: Updated README and provisioning steps for v1.19
2 parents b39aaf5 + bf7c9ab commit 1e3bb58

File tree

5 files changed

+146
-115
lines changed

5 files changed

+146
-115
lines changed

gaudi/README.md

Lines changed: 26 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,18 @@
1-
# Setting up Intel Gaudi Base Operator
1+
# Setting up Intel Gaudi AI Accelerator Operator
22

33
## Overview
4-
[Intel Gaudi Base Operator](https://catalog.redhat.com/software/container-stacks/detail/6683b2cce45daa25e36bddcb) is used to provision Intel Gaudi Accelerator with OpenShift. The steps and yaml files mentioned in this document to provision the Gaudi accelerator are based on [Intel Gaudi Base Operator for OpenShift](https://docs.habana.ai/en/latest/Orchestration/Intel_Gaudi_Base_Operator/index.html).
4+
[Intel Gaudi AI Accelerator Operator](https://catalog.redhat.com/software/container-stacks/detail/6683b2cce45daa25e36bddcb) is used to provision Intel Gaudi Accelerator with OpenShift. The steps and yaml files mentioned in this document to provision the Gaudi accelerator are based on [Intel Gaudi AI Accelerator Operator for OpenShift](https://docs.habana.ai/en/latest/Orchestration/Intel_Gaudi_Base_Operator/index.html).
55

66
If you are familiar with the steps here to manually provision the accelerator, the Red Hat certified Operator and Ansible based [One-Click](/one_click/README.md#reference-playbook-–-habana-gaudi-provisioning) solution can be used as a reference to provision the accelerator automatically.
77

88
## Prerequisities
99
- To Provision RHOCP cluster, follow steps [here](/README.md#provisioning-rhocp-cluster).
10-
- To Install NFD Operator, follow steps [here](/nfd/README.md#install-nfd-operator).
11-
- To Install KMM Operator, follow steps [here](/kmmo/README.md#install-kmm-operator).
1210

13-
## Update Kernel Firmware Search Path with MCO
14-
**Note:** This step will reboot the nodes, it is recommended to do this in the first step.
15-
16-
The default kernel firmware search path `/lib/firmware` in RHCOS is not writable. Command below can be used to add path `/var/lib/fimware` into the firmware search path list.
17-
```
18-
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_firmware_path.yaml
19-
```
20-
21-
## Label Gaudi Accelerator Nodes With NFD
22-
NFD operator can be used to configure NFD to automatically detect the Gaudi accelerators and label the nodes for the following provisioning steps.
23-
```
24-
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_nfd_instance_openshift.yaml
25-
```
26-
Verify NFD has labelled the node correctly:
27-
```
28-
oc get no -o json | jq '.items[].metadata.labels' | grep pci-1da3
29-
30-
"feature.node.kubernetes.io/pci-1da3.present": "true",
31-
```
32-
NFD detects underlying Gaudi Accelerator using its PCI device class and the vendor ID.
33-
34-
## Install Intel Gaudi Base Operator on Red Hat OpenShift
11+
## Install Intel Gaudi AI Accelerator Operator on Red Hat OpenShift
3512
### Installation via web console
36-
Follow the steps below to install Intel Gaudi Base Operator using OpenShift web console:
13+
Follow the steps below to install Intel Gaudi AI Accelerator Operator using OpenShift web console:
3714
1. In the OpenShift web console, navigate to **Operator** -> **OperatorHub**.
38-
2. Search for **Intel Gaudi Base Operator** in all items field -> Click **Install**.
15+
2. Search for **Intel Gaudi AI Accelerator Operator** in all items field -> Click **Install**.
3916
### Verify Installation via web console
4017
1. Go to **Operator** -> **Installed Operators**.
4118
2. Verify that the status of the operator is **Succeeded**.
@@ -54,53 +31,55 @@ NAME READY STATUS RESTARTS AGE
5431
controller-manager-6c8459d9cb-fqs8h 2/2 Running 0 25m
5532
```
5633

57-
## Creating Intel Gaudi Base Operator DeviceConfig Instance
34+
## Creating Intel Gaudi AI Accelerator Operator ClusterPolicy Instance
5835
To create a Habana Gaudi device plugin CR, follow the steps below.
5936

6037
### Create CR via web console
6138
1. Go to **Operator** -> **Installed Operators**.
62-
2. Open **Intel Gaudi Base Operator**.
63-
3. Navigate to tab **Device Config**.
64-
4. Click **Create DeviceConfig** -> set correct parameters -> Click **Create**. To set correct parameters please refer [Using RedHat OpenShift Console](https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Intel_Gaudi_Base_Operator/Deploying_Intel_Gaudi_Base_Operator.html?highlight=openshift#id2).
39+
2. Open **Intel Gaudi AI Accelerator Operator**.
40+
3. Navigate to tab **Cluster Policy**.
41+
4. Click **Create ClusterPolicy** -> set correct parameters -> Click **Create**. To set correct parameters please refer [Using RedHat OpenShift Container Platform Console](https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Kubernetes_Installation/Kubernetes_Operator.html#id1).
6542

6643
### Verify via web console
67-
1. Verify CR by checking the status of **Workloads** -> **DaemonSet** -> **habana-ai-module-device-plugin-xxxxx**.
68-
2. Now `DeviceConfig` is created.
44+
1. Verify CR by checking the status of **Workloads** -> **DaemonSet** -> **habana-ai-device-plugin-ds**, **habana-ai-driver-rhel-9-4-xxxxx**, **habana-ai-feature-discovery-ds**, **habana-ai-metric-exporter-ds**, **habana-ai-runtime-ds**.
45+
2. Now `ClusterPolicy` is created.
6946

7047
### Create CR via CLI
7148
Apply the CR yaml file:
7249
```
73-
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_device_config.yaml
50+
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_cluster_policy.yaml
7451
```
7552

76-
### Verify the DeviceConfig CR is created
77-
You can use command below to verify that the `DeviceConfig` CR has been created:
53+
### Verify the ClusterPolicy CR is created
54+
You can use command below to verify that the `ClusterPolicy` CR has been created:
7855
```
7956
oc get pod -n habana-ai-operator
8057
81-
NAME READY STATUS RESTARTS AGE
82-
controller-manager-6586758d54-qw644 2/2 Running 0 5d5h
83-
habana-ai-habana-runtime-bqpvp 1/1 Running 0 5d6h
84-
habana-ai-module-device-plugin-pljkf-kxgdj 1/1 Running 0 5d6h
85-
habana-ai-node-metrics-rghlr 1/1 Running 0 5d6h
58+
NAME READY STATUS RESTARTS AGE
59+
habana-ai-device-plugin-ds-thj7b 1/1 Running 0 10d
60+
habana-ai-driver-rhel-9-4-416-94-202412170927-0-ds-vqhzb 1/1 Running 2 10d
61+
habana-ai-feature-discovery-ds-ztl2j 1/1 Running 5 10d
62+
habana-ai-metric-exporter-ds-g5qqh 1/1 Running 0 10d
63+
habana-ai-operator-controller-manager-6c995b5646-wl7cp 2/2 Running 0 10d
64+
habana-ai-runtime-ds-x49lf 1/1 Running 0 10d
8665
```
87-
Alternatively, you can also check the status of the `DeviceConfig` CR like below:
66+
Alternatively, you can also check the status of the `ClusterPolicy` CR like below:
8867
```
89-
oc describe deviceconfig habana-ai -n habana-ai-operator
68+
oc describe ClusterPolicy habana-ai -n habana-ai-operator
9069
9170
Name: habana-ai
9271
Namespace: habana-ai-operator
9372
.
9473
.
9574
Status:
9675
Conditions:
97-
Last Transition Time: 2024-07-24T14:05:11Z
76+
Last Transition Time: 2025-01-21T18:50:46Z
9877
Message: All resources have been successfully reconciled
9978
Reason: Reconciled
10079
Status: True
10180
```
10281
## Verify Gaudi Provisioning
103-
After the `DeviceConfig` instance CR is created, it will take some time for the operator to download the Gaudi OOT driver source code and build it on-premise with the help of the KMM operator. The OOT driver module binaries will be loaded into the RHCOS kernel on each node with Gaudi cards labelled by NFD. Then, the Gaudi device plugin can advertise the Gaudi resources listed in the table for the pods on OpenShit to use. Run the command below to check the availability of Gaudi resources:
82+
After the `ClusterPolicy` instance CR is created, it will take some time for the operator to download the Gaudi OOT driver source code and build it on-premise with the help of the KMM operator. The OOT driver module binaries will be loaded into the RHCOS kernel on each node with Gaudi cards labelled by feature discovery. Then, the Gaudi device plugin can advertise the Gaudi resources listed in the table for the pods on OpenShit to use. Run the command below to check the availability of Gaudi resources:
10483
```
10584
oc describe node | grep habana.ai/gaudi
10685
@@ -119,4 +98,4 @@ The resources provided are the user interface for customers to claim and consume
11998
| Habana Gaudi | `habana.ai/gaudi` | Number of Habana Gaudi Card resources ready to claim |
12099

121100
## Upgrade Intel Gaudi SPI Firmware
122-
Refer [Upgrade Intel Gaudi SPI Firmware](/gaudi/Gaudi-SPI-Firmware-Upgrade.md) to upgrade the SPI Firmware on Intel Gaudi.
101+
Refer [Upgrade Intel Gaudi SPI Firmware](/gaudi/Gaudi-SPI-Firmware-Upgrade.md) to upgrade the SPI Firmware on Intel Gaudi.

gaudi/gaudi_cluster_policy.yaml

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Copyright (c) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
# Adapted from https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Kubernetes_Installation/Kubernetes_Operator.html#id2
4+
#
5+
apiVersion: habanalabs.habana.ai/v1
6+
kind: ClusterPolicy
7+
metadata:
8+
name: habana-ai
9+
spec:
10+
image_registry: vault.habana.ai
11+
driver:
12+
driver_loader:
13+
images:
14+
ubuntu_22.04:
15+
repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
16+
tag: 1.19.1-26
17+
rhel_8.6:
18+
repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer
19+
tag: 1.19.1-26
20+
rhel_9.2:
21+
repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer
22+
tag: 1.19.1-26
23+
rhel_9.4:
24+
repository: vault.habana.ai/habana-ai-operator/driver/rhel9.4/driver-installer
25+
tag: 1.19.1-26
26+
tencentos_3.1:
27+
repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer
28+
tag: 1.19.1-26
29+
resources:
30+
limits:
31+
cpu: cpu_str_or_int_optional
32+
memory: memory_str_optional
33+
requests:
34+
cpu: cpu_str_or_int_optional
35+
memory: memory_str_optional
36+
repo_server: vault.habana.ai
37+
repo_path: artifactory/gaudi-installer/repos
38+
mlnx_ofed_repo_path: artifactory/gaudi-installer/deps
39+
mlnx_ofed_version: mlnx-ofed-5.8-2.0.3.0-rhel8.4-x86_64.tar.gz
40+
hugepages: hugepages_number_int_optional
41+
external_ports: turn_on_external_port_bool_optional
42+
firmware_flush: flush_firmware_on_the_gaudi_cards_bool_optional
43+
driver_runner:
44+
image:
45+
repository: vault.habana.ai/habana-ai-operator/driver/rhel9.4/driver-installer
46+
tag: 1.19.1-26
47+
resources:
48+
limits:
49+
cpu: cpu_str_or_int_optional
50+
memory: memory_str_optional
51+
requests:
52+
cpu: cpu_str_or_int_optional
53+
memory: memory_str_optional
54+
device_plugin:
55+
image:
56+
repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin
57+
tag: 1.19.1
58+
resources:
59+
limits:
60+
cpu: cpu_str_or_int_optional
61+
memory: memory_str_optional
62+
requests:
63+
cpu: cpu_str_or_int_optional
64+
memory: memory_str_optional
65+
runtime:
66+
runner:
67+
image:
68+
repository: vault.habana.ai/habana-ai-operator/habana-container-runtime
69+
tag: 1.19.1-26
70+
resources:
71+
limits:
72+
cpu: cpu_str_or_int_optional
73+
memory: memory_str_optional
74+
requests:
75+
cpu: cpu_str_or_int_optional
76+
memory: memory_str_optional
77+
configuration:
78+
container_engine: one_of_containerd_docker_crio
79+
engine_container_runtime_configuration: container_engine_configuration_optional
80+
habana_container_runtime_configuration: container_runtime_configuration_optional
81+
metric_exporter:
82+
runner:
83+
image:
84+
repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter
85+
tag: 1.19.1-26
86+
resources:
87+
limits:
88+
cpu: cpu_str_or_int_optional
89+
memory: memory_str_optional
90+
requests:
91+
cpu: cpu_str_or_int_optional
92+
memory: memory_str_optional
93+
port: 41611
94+
interval: 20
95+
feature_discovery:
96+
runner:
97+
image:
98+
repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery
99+
tag: 1.19.1-26
100+
resources:
101+
limits:
102+
cpu: cpu_str_or_int_optional
103+
memory: memory_str_optional
104+
requests:
105+
cpu: cpu_str_or_int_optional
106+
memory: memory_str_optional
107+
nfd_plugin: boolean_nfd_installed
108+
bmc_monitoring:
109+
image:
110+
repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter
111+
tag: 1.19.1-26
112+
resources:
113+
limits:
114+
cpu: cpu_str_or_int_optional
115+
memory: memory_str_optional
116+
requests:
117+
cpu: cpu_str_or_int_optional
118+
memory: memory_str_optional
119+
node_selector:
120+
key_optional: value_optional

gaudi/gaudi_device_config.yaml

Lines changed: 0 additions & 22 deletions
This file was deleted.

gaudi/gaudi_firmware_path.yaml

Lines changed: 0 additions & 16 deletions
This file was deleted.

gaudi/gaudi_nfd_instance_openshift.yaml

Lines changed: 0 additions & 30 deletions
This file was deleted.

0 commit comments

Comments
 (0)