Skip to content

Commit 1f579c3

Browse files
authored
Merge pull request #311 from chaitanya1731/gaudi_networking
tests: Added Gaudi HCCL Demo L2 Test Case
2 parents b1ee673 + 2fd580c commit 1f579c3

File tree

3 files changed

+107
-0
lines changed

3 files changed

+107
-0
lines changed

tests/gaudi/l2/README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Verify Intel® Gaudi® AI Accelerator Provisioning
2+
3+
## HCCL
4+
HCCL (Habana Collective Communication Library) demo is a program that demonstrates HCCL usage and supports communication via Gaudi based scale-out or Host NIC scale-out. Refer [HCCL Demo](https://github.com/HabanaAI/hccl_demo/tree/main?tab=readme-ov-file#hccl-demo) for more details.
5+
6+
Build the workload container image:
7+
```
8+
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/hccl_build.yaml
9+
```
10+
Deploy and execute the workload:
11+
```
12+
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/hccl_job.yaml
13+
```
14+
15+
Verify Output:
16+
```
17+
$ oc get pods
18+
NAME READY STATUS RESTARTS AGE
19+
hccl-demo-workload-1-build 0/1 Completed 0 23m
20+
hccl-demo-workload-wq8mx 0/1 Completed 0 10m
21+
```
22+
```
23+
$ oc logs hccl-demo-workload-wq8mx
24+
Affinity: Numa mapping directory: /tmp/affinity_topology_output
25+
Affinity: Script has not been executed before, going to execute...
26+
Affinity: Script has finished successfully
27+
Welcome to HCCL demo
28+
.
29+
.
30+
.
31+
####################################################################################################
32+
[BENCHMARK] hcclAllReduce(src!=dst, data_size=33554432, count=8388608, dtype=float, iterations=1000)
33+
[BENCHMARK] NW Bandwidth : 258.209121 GB/s
34+
[BENCHMARK] Algo Bandwidth : 147.548069 GB/s
35+
####################################################################################################
36+
```

tests/gaudi/l2/hccl_build.yaml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
apiVersion: image.openshift.io/v1
2+
kind: ImageStream
3+
metadata:
4+
name: hccl-demo-workload
5+
namespace: hccl-demo
6+
---
7+
kind: BuildConfig
8+
apiVersion: build.openshift.io/v1
9+
metadata:
10+
name: hccl-demo-workload
11+
namespace: hccl-demo
12+
spec:
13+
output:
14+
to:
15+
kind: ImageStreamTag
16+
name: 'hccl-demo-workload:latest'
17+
strategy:
18+
type: Docker
19+
source:
20+
type: Dockerfile
21+
dockerfile: |
22+
ARG BUILDER=vault.habana.ai/gaudi-docker/1.17.1/rhel9.4/habanalabs/pytorch-installer-2.3.1:1.17.1-40
23+
FROM ${BUILDER} AS builder
24+
25+
WORKDIR /
26+
RUN git clone https://github.com/HabanaAI/hccl_demo.git \
27+
&& cd hccl_demo \
28+
&& make
29+
30+
WORKDIR /
31+
RUN git clone https://github.com/HabanaAI/hccl_ofi_wrapper.git \
32+
&& export LIBFABRIC_ROOT=/opt/habanalabs/libfabric-1.20.0 \
33+
&& cd hccl_ofi_wrapper \
34+
&& make \
35+
&& cp libhccl_ofi_wrapper.so /usr/lib/habanalabs/libhccl_ofi_wrapper.so \
36+
&& ldconfig \
37+
&& export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/habanalabs/
38+
39+
WORKDIR /hccl_demo
40+
triggers:
41+
- type: ConfigChange
42+
runPolicy: Serial

tests/gaudi/l2/hccl_job.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
apiVersion: batch/v1
2+
kind: Job
3+
metadata:
4+
name: hccl-demo-workload
5+
namespace: hccl-demo
6+
spec:
7+
template:
8+
metadata:
9+
spec:
10+
restartPolicy: Never
11+
serviceAccountName: hccl-demo-anyuid-sa
12+
containers:
13+
- name: hccl-demo-workload
14+
image: image-registry.openshift-image-registry.svc:5000/hccl-demo/hccl-demo-workload:latest
15+
workingDir: "/hccl_demo"
16+
command: ["/bin/bash", "-c", "--"]
17+
## sleep for 20 seconds to avoid race condition
18+
args:
19+
- |
20+
sleep 20
21+
python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 32m --test all_reduce --loop 1000 --ranks_per_node 8
22+
sleep 20
23+
env:
24+
- name: HCCL_COMM_ID
25+
value: '127.0.0.1:5555'
26+
resources:
27+
limits:
28+
habana.ai/gaudi: 8
29+
imagePullPolicy: IfNotPresent

0 commit comments

Comments
 (0)