Skip to content

Commit 9100491

Browse files
authored
Merge pull request #282 from chaitanya1731/playbook
gaudi: Added Gaudi Provisioning on OpenShift details
2 parents e7fc676 + ebe1241 commit 9100491

File tree

5 files changed

+217
-0
lines changed

5 files changed

+217
-0
lines changed

gaudi/README.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Setting up HabanaAI Operator
2+
3+
## Overview
4+
[Habana AI Operator](https://catalog.redhat.com/software/container-stacks/detail/64342b3bcbfbb9a6588ce8dd) is used to provision Intel Gaudi Accelerator with OpenShift. The steps and yaml files mentioned in this document to provision the Gaudi accelerator are based on [HabanaAI Operator for OpenShift](https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/index.html).
5+
6+
If you are familiar with the steps here to manually provision the accelerator, the Red Hat certified Operator and Ansible based [One-Click](/one_click/README.md) solution can be used as a reference to provision the accelerator automatically.
7+
8+
## Prerequisities
9+
- To Provision RHOCP cluster, follow steps [here](/README.md#provisioning-rhocp-cluster).
10+
- To Install NFD Operator, follow steps [here](/nfd/README.md#install-nfd-operator).
11+
- To Install KMM Operator, follow steps [here](/kmmo/README.md#install-kmm-operator).
12+
13+
## Update Kernel Firmware Search Path with MCO
14+
**Note:** This step will reboot the nodes, it is recommended to do this in the first step.
15+
16+
The default kernel firmware search path `/lib/firmware` in RHCOS is not writable. Command below can be used to add path `/var/lib/fimware` into the firmware search path list.
17+
```
18+
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_firmware_path.yaml
19+
```
20+
21+
## Label Gaudi Accelerator Nodes With NFD
22+
NFD operator can be used to configure NFD to automatically detect the Gaudi accelerators and label the nodes for the flowing provisioning steps.
23+
```
24+
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_nfd_instance_openshift.yaml
25+
```
26+
Verify NFD has labelled the node correctly:
27+
```
28+
oc get no -o json | jq '.items[].metadata.labels' | grep pci-1da3
29+
30+
"feature.node.kubernetes.io/pci-1da3.present": "true",
31+
```
32+
NFD detects underlying Gaudi Accelerator using its PCI device class and the vendor ID.
33+
34+
## Install HabanaAI Operator on Red Hat OpenShift
35+
### Installation via web console
36+
Follow the steps below to install HabanaAI Operator using OpenShift web console:
37+
1. In the OpenShift web console, navigate to **Operator** -> **OperatorHub**.
38+
2. Search for **HabanaAI Operator** in all items field -> Click **Install**.
39+
### Verify Installation via web console
40+
1. Go to **Operator** -> **Installed Operators**.
41+
2. Verify that the status of the operator is **Succeeded**.
42+
43+
### Installation via Command Line Interface (CLI)
44+
```
45+
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_install_operator.yaml
46+
```
47+
48+
### Verify Installation via CLI
49+
Verify that the operator controller manager pod is up and running:
50+
```
51+
oc get pods -n habana-ai-operator
52+
53+
NAME READY STATUS RESTARTS AGE
54+
controller-manager-6c8459d9cb-fqs8h 2/2 Running 0 25m
55+
```
56+
57+
## Creating Habana AI Operator DeviceConfig Instance
58+
To create a Habana Gaudi device plugin CR, follow the steps below.
59+
60+
### Create CR via web console
61+
1. Go to **Operator** -> **Installed Operators**.
62+
2. Open **HabanaAI Operator**.
63+
3. Navigate to tab **Device Config**.
64+
4. Click **Create DeviceConfig** -> set correct parameters -> Click **Create**. To set correct parameters please refer [Using RedHat OpenShift Console](https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/Deploying_HabanaAI_Operator.html#id2).
65+
66+
### Verify via web console
67+
1. Verify CR by checking the status of **Workloads** -> **DaemonSet** -> **habana-ai-module-device-plugin-xxxxx**.
68+
2. Now `DeviceConfig` is created.
69+
70+
### Create CR via CLI
71+
Apply the CR yaml file:
72+
```
73+
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_device_config.yaml
74+
```
75+
76+
### Verify the DeviceConfig CR is created
77+
You can use command below to verify that the `DeviceConfig` CR has been created:
78+
```
79+
oc get pod -n habana-ai-operator
80+
81+
NAME READY STATUS RESTARTS AGE
82+
controller-manager-6586758d54-qw644 2/2 Running 0 5d5h
83+
habana-ai-habana-runtime-bqpvp 1/1 Running 0 5d6h
84+
habana-ai-module-device-plugin-pljkf-kxgdj 1/1 Running 0 5d6h
85+
habana-ai-node-metrics-rghlr 1/1 Running 0 5d6h
86+
```
87+
Alternatively, you can also check the status of the `DeviceConfig` CR like below:
88+
```
89+
oc describe deviceconfig habana-ai -n habana-ai-operator
90+
91+
Name: habana-ai
92+
Namespace: habana-ai-operator
93+
.
94+
.
95+
Status:
96+
Conditions:
97+
Last Transition Time: 2024-07-24T14:05:11Z
98+
Message: All resources have been successfully reconciled
99+
Reason: Reconciled
100+
Status: True
101+
```
102+
## Verify Gaudi Provisioning
103+
After the `DeviceConfig` instance CR is created, it will take some time for the operator to download the Gaudi OOT driver source code and build it on-premise with the help of the KMM operator. The OOT driver module binaries will be loaded into the RHCOS kernel on each node with Gaudi cards labelled by NFD. Then, the Gaudi device plugin can advertise the Gaudi resources listed in the table for the pods on OpenShit to use. Run the command below to check the availability of Gaudi resources:
104+
```
105+
oc describe node | grep habana.ai/gaudi
106+
107+
habana.ai/gaudi: 8 -> Gaudi cards number on the cluster
108+
habana.ai/gaudi: 8 -> Gaudi cards number allocatble on the cluster
109+
habana.ai/gaudi 4 4 -> number of Gaudi cards allocated and number of Gardi cards available
110+
```
111+
112+
To view the metrics on a node with Gaudi card, refer [Collecting Metrics](https://docs.habana.ai/en/latest/Orchestration/Prometheus_Metric_Exporter.html?highlight=metrics#collecting-metrics).
113+
114+
## Resources Provided by Habana Gaudi Device Plugin
115+
The resources provided are the user interface for customers to claim and consume the hardware features from the user pods. See below table for the details:
116+
117+
| Feature | Resources | Description |
118+
| ------- | --------- | ----------- |
119+
| Habana Gaudi | `habana.ai/gaudi` | Number of Habana Gaudi Card resources ready to claim |

gaudi/gaudi_device_config.yaml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Copyright (c) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
# Adapted from https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/Deploying_HabanaAI_Operator.html#id3
4+
#
5+
apiVersion: habana.ai/v1
6+
kind: DeviceConfig
7+
metadata:
8+
name: habana-ai
9+
namespace: habana-ai-operator
10+
spec:
11+
devicePlugin:
12+
image: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin
13+
version: 1.15.1
14+
driver:
15+
image: image-registry.openshift-image-registry.svc:5000/habana-ai-operator/habana-ai-driver
16+
version: 1.15.1-15
17+
habanaRuntime:
18+
image: vault.habana.ai/habana-ocp-operator/1.15.1/habana-runtime
19+
version: 1.15.1-15
20+
nodeMetrics:
21+
image: vault.habana.ai/gaudi-metric-exporter/metric-exporter
22+
version: 1.15.1-15

gaudi/gaudi_firmware_path.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Copyright (c) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
# Adapted from https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/Environment_Setup.html#installing-intel-gaudi-firmware
4+
#
5+
apiVersion: machineconfiguration.openshift.io/v1
6+
kind: MachineConfig
7+
metadata:
8+
labels:
9+
machineconfiguration.openshift.io/role: worker
10+
name: firmware-path
11+
spec:
12+
config:
13+
ignition:
14+
version: 3.2.0
15+
kernelArguments:
16+
- 'firmware_class.path=/var/lib/firmware'

gaudi/gaudi_install_operator.yaml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Copyright (c) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
# Adapted from https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/Deploying_HabanaAI_Operator.html#using-cli
4+
#
5+
---
6+
apiVersion: v1
7+
kind: Namespace
8+
metadata:
9+
name: habana-ai-operator
10+
---
11+
apiVersion: operators.coreos.com/v1
12+
kind: OperatorGroup
13+
metadata:
14+
name: habana-ai-operator
15+
namespace: habana-ai-operator
16+
spec:
17+
targetNamespaces:
18+
- habana-ai-operator
19+
---
20+
apiVersion: operators.coreos.com/v1alpha1
21+
kind: Subscription
22+
metadata:
23+
name: habana-ai-operator
24+
namespace: habana-ai-operator
25+
spec:
26+
channel: stable
27+
installPlanApproval: Automatic
28+
name: habana-ai-operator
29+
source: certified-operators
30+
sourceNamespace: openshift-marketplace
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Copyright (c) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
# Adapted from https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/Environment_Setup.html#id2
4+
#
5+
apiVersion: nfd.openshift.io/v1
6+
kind: NodeFeatureDiscovery
7+
metadata:
8+
name: nfd-instance
9+
namespace: openshift-nfd
10+
spec:
11+
extraLabelNs:
12+
- habana.ai
13+
instance: ''
14+
operand:
15+
image: >-
16+
registry.redhat.io/openshift4/ose-node-feature-discovery@sha256:edd2adfdf423d6a1eb7e8c1e388d9cf5fbc829e7e66c7bc955e9b2a6f50d1a47
17+
servicePort: 12000
18+
topologyupdater: false
19+
workerConfig:
20+
configData: |
21+
core:
22+
sleepInterval: 60s
23+
sources:
24+
pci:
25+
deviceClassWhitelist:
26+
- "0200"
27+
- "03"
28+
- "12"
29+
deviceLabelFields:
30+
- "vendor"

0 commit comments

Comments
 (0)