Skip to content

Commit 8e3a7b8

Browse files
authored
Merge pull request #284 from chaitanya1731/one-click
one-click: Add Gaudi one-click Provisioning
2 parents 9100491 + 67f0de6 commit 8e3a7b8

File tree

3 files changed

+192
-3
lines changed

3 files changed

+192
-3
lines changed

gaudi/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
## Overview
44
[Habana AI Operator](https://catalog.redhat.com/software/container-stacks/detail/64342b3bcbfbb9a6588ce8dd) is used to provision Intel Gaudi Accelerator with OpenShift. The steps and yaml files mentioned in this document to provision the Gaudi accelerator are based on [HabanaAI Operator for OpenShift](https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/index.html).
55

6-
If you are familiar with the steps here to manually provision the accelerator, the Red Hat certified Operator and Ansible based [One-Click](/one_click/README.md) solution can be used as a reference to provision the accelerator automatically.
6+
If you are familiar with the steps here to manually provision the accelerator, the Red Hat certified Operator and Ansible based [One-Click](/one_click/README.md#reference-playbook-–-habana-gaudi-provisioning) solution can be used as a reference to provision the accelerator automatically.
77

88
## Prerequisities
99
- To Provision RHOCP cluster, follow steps [here](/README.md#provisioning-rhocp-cluster).

one_click/README.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ The referenced Ansible playbooks here can be used by the cluster administrators
1212
This playbook demonstrates the one-click provisioning of Intel Data Center GPU on an RHOCP cluster. The steps involved are installation and configuration of general Operators including Node Feature Discovery (NFD) operator, Kernel Module Management (KMM) operator, and the Intel Device Plugins Operator.
1313

1414
### Prerequisite
15-
1615
Before running the playbook, ensure the following prerequisites are met:
1716
- Provisioned RHOCP Cluster
1817
- Red Hat Enterprise Linux (RHEL) system with [Ansible](https://docs.ansible.com/ansible/2.9/installation_guide/intro_installation.html#installing-ansible-on-rhel-centos-or-fedora) installed and configured with a `kubeconfig` to connect to your RHOCP cluster.
@@ -23,8 +22,27 @@ To run the ansible playbook, clone this repository to your RHEL system. Navigate
2322
$ git clone https://github.com/intel/intel-technology-enabling-for-openshift.git
2423
$ cd intel-technology-enabling-for-openshift/one_click
2524
```
26-
2725
Execute below single command to provision Intel Data Center GPU:
2826
```
2927
$ ansible-playbook gpu_provisioning_playbook.yaml
28+
```
29+
30+
## Reference Playbook – Habana Gaudi Provisioning
31+
This playbook demonstrates the one-click provisioning of Habana Gaudi Accelerator on an RHOCP cluster. The steps involved are installation and configuration of general Operators including Node Feature Discovery (NFD) operator, Kernel Module Management (KMM) operator, and the HabanaAI Operator. The playbook also creates the Gaudi `DeviceConfig` CR which deploys the Gaudi Out-of-Tree drivers, Gaudi device plugins, Habana container runtime and Habana node metrics.
32+
33+
### Prerequisite
34+
Before running the playbook, ensure the following prerequisites are met:
35+
- Provisioned RHOCP Cluster
36+
- Red Hat Enterprise Linux (RHEL) system with [Ansible](https://docs.ansible.com/ansible/2.9/installation_guide/intro_installation.html#installing-ansible-on-rhel-centos-or-fedora) installed and configured with a `kubeconfig` to connect to your RHOCP cluster.
37+
- Set Firmware search path using MCO, follow [Update Kernel Firmware Search Path with MCO](/gaudi/README.md#update-kernel-firmware-search-path-with-mco).
38+
39+
### Run the Playbook
40+
To run the ansible playbook, clone this repository to your RHEL system. Navigate to the directory containing the playbook.
41+
```
42+
$ git clone https://github.com/intel/intel-technology-enabling-for-openshift.git
43+
$ cd intel-technology-enabling-for-openshift/one_click
44+
```
45+
Execute below single command to provision Habana Gaudi Accelerator:
46+
```
47+
$ ansible-playbook gaudi_provisioning_playbook.yaml
3048
```
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Copyright (c) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
- hosts: localhost
4+
vars:
5+
kubeconfig_path: "~/.kube/config"
6+
environment:
7+
KUBECONFIG: "{{ kubeconfig_path }}"
8+
tasks:
9+
- name: Install Dependencies
10+
tags:
11+
- install_dependencies
12+
block:
13+
- name: NFD - Install Node Feature Discovery Operator
14+
tags:
15+
- nfd
16+
block:
17+
- name: NFD - Create openshift-nfd namespace
18+
k8s:
19+
name: openshift-nfd
20+
api_version: v1
21+
kind: Namespace
22+
state: present
23+
wait: yes
24+
- name: NFD - Create an nfd-operator group v1
25+
k8s:
26+
definition:
27+
apiVersion: operators.coreos.com/v1
28+
kind: OperatorGroup
29+
metadata:
30+
generateName: openshift-nfd-
31+
name: openshift-nfd
32+
namespace: openshift-nfd
33+
spec:
34+
targetNamespaces:
35+
- openshift-nfd
36+
wait: yes
37+
- name: NFD - Create subscription for RH NFD operator
38+
k8s:
39+
definition:
40+
apiVersion: operators.coreos.com/v1alpha1
41+
kind: Subscription
42+
metadata:
43+
name: nfd
44+
namespace: openshift-nfd
45+
spec:
46+
channel: "stable"
47+
installPlanApproval: Automatic
48+
name: nfd
49+
source: redhat-operators
50+
sourceNamespace: openshift-marketplace
51+
wait: yes
52+
wait_condition:
53+
reason: AllCatalogSourcesHealthy
54+
type: CatalogSourcesUnhealthy
55+
status: 'False'
56+
- name: NFD - Wait until the nfd-operator-controller Deployment is available
57+
k8s_info:
58+
kind: Deployment
59+
wait: yes
60+
name: nfd-controller-manager
61+
namespace: openshift-nfd
62+
wait_condition:
63+
type: Available
64+
status: 'True'
65+
reason: MinimumReplicasAvailable
66+
- name: KMM - Install Kernel Module Management Operator
67+
tags:
68+
- kmm
69+
block:
70+
- name: KMM - Create openshift-kmm namespace
71+
k8s:
72+
name: openshift-kmm
73+
api_version: v1
74+
kind: Namespace
75+
state: present
76+
wait: yes
77+
- name: KMM - Create OperatorGroup v1 in openshift-kmm namespace
78+
k8s:
79+
definition:
80+
apiVersion: operators.coreos.com/v1
81+
kind: OperatorGroup
82+
metadata:
83+
name: kernel-module-management
84+
namespace: openshift-kmm
85+
wait: yes
86+
- name: KMM - Create Subscription for KMM Operator
87+
k8s:
88+
definition:
89+
apiVersion: operators.coreos.com/v1alpha1
90+
kind: Subscription
91+
metadata:
92+
name: kernel-module-management
93+
namespace: openshift-kmm
94+
spec:
95+
channel: stable
96+
installPlanApproval: Automatic
97+
name: kernel-module-management
98+
source: redhat-operators
99+
sourceNamespace: openshift-marketplace
100+
wait: yes
101+
wait_condition:
102+
reason: AllCatalogSourcesHealthy
103+
type: CatalogSourcesUnhealthy
104+
status: 'False'
105+
- name: KMM - Wait until the kmm-operator-controller Deployment is available
106+
k8s_info:
107+
kind: Deployment
108+
wait: yes
109+
name: kmm-operator-controller
110+
namespace: openshift-kmm
111+
wait_condition:
112+
type: Available
113+
status: 'True'
114+
reason: MinimumReplicasAvailable
115+
- name: Install HabanaAI Operator
116+
tags:
117+
- habana-ai
118+
block:
119+
- name: Install HabanaAI Operator
120+
k8s:
121+
state: present
122+
src: "../gaudi/gaudi_install_operator.yaml"
123+
wait: yes
124+
- name: Wait until the Habana controller-manager Deployment is available
125+
k8s_info:
126+
kind: Deployment
127+
wait: yes
128+
name: controller-manager
129+
namespace: habana-ai-operator
130+
wait_condition:
131+
type: Available
132+
status: 'True'
133+
reason: MinimumReplicasAvailable
134+
- name: NFD - Install NFD CRs
135+
block:
136+
- name: NFD - Create NFD discovery instance for Habana Gaudi
137+
k8s:
138+
state: present
139+
src: "../gaudi/gaudi_nfd_instance_openshift.yaml"
140+
wait: yes
141+
- name: Install Habana Gaudi DeviceConfig CR
142+
block:
143+
- name: Create Habana Gaudi DeviceConfig
144+
k8s:
145+
state: present
146+
src: "../gaudi/gaudi_device_config.yaml"
147+
wait: yes
148+
- name: Verify Habana Gaudi Resources
149+
tags:
150+
- gaudi_resource_test
151+
block:
152+
- name: Get Gaudi Node Resource Information
153+
kubernetes.core.k8s_info:
154+
api: v1
155+
kind: Node
156+
label_selectors:
157+
- "kmm.node.kubernetes.io/habana-ai-operator.habana-ai-module.device-plugin-ready="
158+
- "kmm.node.kubernetes.io/habana-ai-operator.habana-ai-module.ready="
159+
wait: yes
160+
wait_timeout: 120
161+
register: cluster_nodes_info
162+
until:
163+
- cluster_nodes_info.resources is defined
164+
- name: Print cluster resources
165+
debug:
166+
msg:
167+
- "Please verify Capacity and Allocatable Habana Gaudi Resources on the node - "
168+
- "Capacity: {
169+
'habana.ai/gaudi': {{ cluster_nodes_info.resources[0].status.capacity['habana.ai/gaudi'] }},"
170+
- "Allocatable Resources: {
171+
'habana.ai/gaudi': {{ cluster_nodes_info.resources[0].status.allocatable['habana.ai/gaudi'] }},"

0 commit comments

Comments
 (0)