Skip to content

Commit 8d5c358

Browse files
authored
Merge pull request #329 from vbedida79/patch-301024-1
tests_gaudi: Added L2 vllm workload
2 parents de5b511 + dd2a16c commit 8d5c358

File tree

4 files changed

+313
-0
lines changed

4 files changed

+313
-0
lines changed

tests/gaudi/l2/README.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,3 +75,104 @@ Welcome to HCCL demo
7575
[BENCHMARK] Algo Bandwidth : 147.548069 GB/s
7676
####################################################################################################
7777
```
78+
<<<<<<< HEAD
79+
=======
80+
81+
## vLLM
82+
vLLM is a serving engine for LLM's. The following workloads deploys a VLLM server with an LLM using Intel Gaudi. Refer to [Intel Gaudi vLLM fork](https://github.com/HabanaAI/vllm-fork.git) for more details.
83+
84+
Build the workload container image:
85+
```
86+
git clone https://github.com/HabanaAI/vllm-fork.git --branch v1.18.0
87+
88+
cd vllm/
89+
90+
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/vllm_buildconfig.yaml
91+
92+
$ oc start-build vllm-workload --from-dir=./ --follow
93+
```
94+
Check if the build has completed
95+
```
96+
$ oc get builds
97+
NAMESPACE NAME TYPE FROM STATUS STARTED DURATION
98+
gaudi-validation vllm-workload-1 Docker Dockerfile Complete 7 minutes ago 4m58s
99+
100+
```
101+
102+
Deploy the workload:
103+
* Update the hugging face token in the ```vllm_hf_secret.yaml``` file, refer to [link](https://huggingface.co/docs/hub/en/security-tokens) for more details.
104+
```
105+
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/vllm_hf_secret.yaml
106+
```
107+
meta-llama/Llama-3.1-8B model is used in this deployment and the hugging face token is used to access such gated models.
108+
* For the PV setup with NFS, refer to [documentation](https://docs.openshift.com/container-platform/4.17/storage/persistent_storage/persistent-storage-nfs.html).
109+
```
110+
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/vllm_deployment.yaml
111+
```
112+
Create the vllm service
113+
```
114+
oc expose deploy/vllm-workload
115+
```
116+
Verify Output:
117+
```
118+
$ oc get pods
119+
NAME READY STATUS RESTARTS AGE
120+
vllm-workload-1-build 0/1 Completed 0 19m
121+
vllm-workload-55f7c6cb7b-cwj2b 1/1 Running 0 8m36s
122+
123+
$ oc get svc
124+
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
125+
vllm-workload ClusterIP 1.2.3.4 <none> 8000/TCP 114s
126+
```
127+
```
128+
$ oc logs vllm-workload-55f7c6cb7b-cwj2b
129+
130+
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_MIN=32 (default:min)
131+
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_STEP=32 (default:step)
132+
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_MAX=256 (default:max)
133+
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:min)
134+
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:step)
135+
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:max)
136+
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:min)
137+
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:step)
138+
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:max)
139+
INFO 10-30 19:35:53 habana_model_runner.py:691] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
140+
INFO 10-30 19:35:53 habana_model_runner.py:696] Decode bucket config (min, step, max_warmup) bs:[32, 32, 256], block:[128, 128, 4096]
141+
============================= HABANA PT BRIDGE CONFIGURATION ===========================
142+
PT_HPU_LAZY_MODE = 1
143+
PT_RECIPE_CACHE_PATH =
144+
PT_CACHE_FOLDER_DELETE = 0
145+
PT_HPU_RECIPE_CACHE_CONFIG =
146+
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
147+
PT_HPU_LAZY_ACC_PAR_MODE = 1
148+
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
149+
PT_HPU_EAGER_PIPELINE_ENABLE = 1
150+
PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
151+
---------------------------: System Configuration :---------------------------
152+
Num CPU Cores : 160
153+
CPU RAM : 1056371848 KB
154+
------------------------------------------------------------------------------
155+
INFO 10-30 19:35:56 selector.py:85] Using HabanaAttention backend.
156+
INFO 10-30 19:35:56 loader.py:284] Loading weights on hpu ...
157+
INFO 10-30 19:35:56 weight_utils.py:224] Using model weights format ['*.safetensors', '*.bin', '*.pt']
158+
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
159+
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:03<00:11, 3.87s/it]
160+
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:07<00:07, 3.71s/it]
161+
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:10<00:03, 3.59s/it]
162+
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:11<00:00, 2.49s/it]
163+
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:11<00:00, 2.93s/it]
164+
```
165+
Run inference requests using the service url.
166+
```
167+
sh-5.1# curl "http://vllm-workload.gaudi-validation.svc.cluster.local:8000/v1/models"{"object":"list","data":[{"id":"meta-llama/Llama-3.1-8B","object":"model","created":1730317412,"owned_by":"vllm","root":"meta-llama/Llama-3.1-8B","parent":null,"max_model_len":131072,"permission":[{"id":"modelperm-452b2bd990834aa5a9416d083fcc4c9e","object":"model_permission","created":1730317412,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
168+
```
169+
170+
```
171+
sh-5.1# curl http://vllm-workload.gaudi-validation.svc.cluster.local:8000/v1/completions -H "Content-Type: application/json" -d '{
172+
"model": "meta-llama/Llama-3.1-8B",
173+
"prompt": "A constellation is a",
174+
"max_tokens": 10
175+
}'
176+
{"id":"cmpl-9a0442d0da67411081837a3a32a354f2","object":"text_completion","created":1730321284,"model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":" group of individual stars that forms a pattern or figure","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":15,"completion_tokens":10}}
177+
```
178+
>>>>>>> 46ef40e (tests_gaudi: Added L2 vllm workload)

tests/gaudi/l2/vllm_buildconfig.yaml

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Copyright (c) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: image.openshift.io/v1
5+
kind: ImageStream
6+
metadata:
7+
name: vllm-workload
8+
namespace: gaudi-validation
9+
spec: {}
10+
---
11+
apiVersion: build.openshift.io/v1
12+
kind: BuildConfig
13+
metadata:
14+
name: vllm-workload
15+
namespace: gaudi-validation
16+
spec:
17+
triggers:
18+
- type: "ConfigChange"
19+
- type: "ImageChange"
20+
runPolicy: "Serial"
21+
source:
22+
type: Dockerfile
23+
dockerfile: |
24+
ARG BASE_IMAGE=vault.habana.ai/gaudi-docker/1.18.0/rhel9.4/habanalabs/pytorch-installer-2.4.0:1.18.0-524
25+
FROM ${BASE_IMAGE} as habana-base
26+
27+
USER root
28+
29+
ENV VLLM_TARGET_DEVICE="hpu"
30+
ENV HABANA_SOFTWARE_VERSION="1.18.0-524"
31+
32+
RUN dnf -y update --best --allowerasing --skip-broken && dnf clean all
33+
34+
WORKDIR /workspace
35+
36+
## Python Installer #################################################################
37+
FROM habana-base as python-install
38+
39+
ARG PYTHON_VERSION=3.11
40+
41+
ENV VIRTUAL_ENV=/opt/vllm
42+
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
43+
RUN dnf install -y --setopt=install_weak_deps=0 --nodocs \
44+
python${PYTHON_VERSION}-wheel && \
45+
python${PYTHON_VERSION} -m venv $VIRTUAL_ENV --system-site-packages && pip install --no-cache -U pip wheel && dnf clean all
46+
47+
## Python Habana base #################################################################
48+
FROM python-install as python-habana-base
49+
50+
ENV VIRTUAL_ENV=/opt/vllm
51+
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
52+
53+
# install Habana Software and common dependencies
54+
RUN --mount=type=cache,target=/root/.cache/pip \
55+
--mount=type=bind,source=requirements-common.txt,target=requirements-common.txt \
56+
--mount=type=bind,source=requirements-hpu.txt,target=requirements-hpu.txt \
57+
pip install \
58+
-r requirements-hpu.txt
59+
60+
## Builder #####################################################################
61+
FROM python-habana-base AS build
62+
63+
# install build dependencies
64+
65+
# copy input files
66+
COPY csrc csrc
67+
COPY setup.py setup.py
68+
COPY cmake cmake
69+
COPY CMakeLists.txt CMakeLists.txt
70+
COPY requirements-common.txt requirements-common.txt
71+
COPY requirements-hpu.txt requirements-hpu.txt
72+
COPY pyproject.toml pyproject.toml
73+
74+
# max jobs used by Ninja to build extensions
75+
ARG max_jobs=2
76+
ENV MAX_JOBS=${max_jobs}
77+
# # make sure punica kernels are built (for LoRA)
78+
# HPU currently doesn't support LoRA
79+
# ENV VLLM_INSTALL_PUNICA_KERNELS=1
80+
81+
# Copy the entire directory before building wheel
82+
COPY vllm vllm
83+
84+
ENV CCACHE_DIR=/root/.cache/ccache
85+
RUN --mount=type=cache,target=/root/.cache/ccache \
86+
--mount=type=cache,target=/root/.cache/pip \
87+
--mount=type=bind,src=.git,target=/workspace/.git \
88+
env CFLAGS="-march=haswell" \
89+
CXXFLAGS="$CFLAGS $CXXFLAGS" \
90+
CMAKE_BUILD_TYPE=Release \
91+
python3 setup.py bdist_wheel --dist-dir=dist
92+
93+
## Release #####################################################################
94+
FROM python-install AS vllm-openai
95+
96+
WORKDIR /workspace
97+
98+
ENV VIRTUAL_ENV=/opt/vllm
99+
ENV PATH=$VIRTUAL_ENV/bin/:$PATH
100+
101+
# Triton needs a CC compiler
102+
RUN dnf install -y --setopt=install_weak_deps=0 --nodocs gcc \
103+
&& dnf clean all
104+
105+
# install vllm wheel first, so that torch etc will be installed
106+
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/workspace/dist \
107+
--mount=type=cache,target=/root/.cache/pip \
108+
pip install $(echo dist/*.whl)'[tensorizer]' --verbose
109+
110+
ENV HF_HUB_OFFLINE=1 \
111+
PORT=8000 \
112+
HOME=/home/vllm \
113+
VLLM_USAGE_SOURCE=production-docker-image
114+
115+
# setup non-root user for OpenShift
116+
# In OpenShift the user ID is randomly assigned, for compatibility we also
117+
# set up a non-root user here.
118+
RUN umask 002 \
119+
&& useradd --uid 2000 --gid 0 vllm \
120+
&& chmod g+rwx $HOME /usr/src /workspace
121+
122+
COPY LICENSE /licenses/vllm.md
123+
124+
USER 2000
125+
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
126+
strategy:
127+
type: Docker
128+
noCache: true
129+
dockerStrategy:
130+
buildArgs:
131+
- name: "BASE_IMAGE"
132+
value: "vault.habana.ai/gaudi-docker/1.18.0/rhel9.4/habanalabs/pytorch-installer-2.4.0:1.18.0-524"
133+
output:
134+
to:
135+
kind: ImageStreamTag
136+
name: vllm-workload:latest

tests/gaudi/l2/vllm_deployment.yaml

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
---
2+
kind: PersistentVolumeClaim
3+
apiVersion: v1
4+
metadata:
5+
name: vllm-workload-pvc
6+
namespace: gaudi-validation
7+
spec:
8+
accessModes:
9+
- ReadWriteOnce
10+
resources:
11+
requests:
12+
storage: 60Gi
13+
storageClassName: "" # Add your storage class
14+
volumeMode: Filesystem
15+
---
16+
apiVersion: apps/v1
17+
kind: Deployment
18+
metadata:
19+
name: vllm-workload
20+
namespace: gaudi-validation
21+
labels:
22+
app: vllm-workload
23+
spec:
24+
replicas: 1
25+
selector:
26+
matchLabels:
27+
app: vllm-workload
28+
template:
29+
metadata:
30+
labels:
31+
app: vllm-workload
32+
spec:
33+
containers:
34+
- name: vllm-container
35+
image: image-registry.openshift-image-registry.svc:5000/gaudi-validation/vllm-workload:latest
36+
command: [ "/bin/bash", "-c", "--" ]
37+
args: ["vllm serve meta-llama/Llama-3.1-8B"] # Add the model
38+
ports:
39+
- containerPort: 8000
40+
resources:
41+
limits:
42+
habana.ai/gaudi: 1
43+
env:
44+
- name: HF_TOKEN
45+
valueFrom:
46+
secretKeyRef:
47+
name: hf-token
48+
key: hf-token
49+
- name: HF_HOME
50+
value: /home/vllm/.cache/huggingface
51+
- name: HF_HUB_OFFLINE
52+
value: "0"
53+
imagePullPolicy: Always
54+
volumeMounts:
55+
- name: hf-cache
56+
mountPath: /home/vllm/.cache
57+
- name: shm
58+
mountPath: /dev/shm
59+
volumes:
60+
- name: hf-cache
61+
persistentVolumeClaim:
62+
claimName: vllm-workload-pvc
63+
- name: shm
64+
emptyDir:
65+
medium: Memory
66+
sizeLimit: "2Gi"

tests/gaudi/l2/vllm_hf_secret.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Copyright (c) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: v1
4+
kind: Secret
5+
metadata:
6+
name: hf-token
7+
namespace: gaudi-validation
8+
type: Opaque
9+
data:
10+
hf-token: # Add your token

0 commit comments

Comments
 (0)