@@ -75,3 +75,104 @@ Welcome to HCCL demo
75
75
[BENCHMARK] Algo Bandwidth : 147.548069 GB/s
76
76
####################################################################################################
77
77
```
78
+ <<<<<<< HEAD
79
+ =======
80
+
81
+ ## vLLM
82
+ vLLM is a serving engine for LLM's. The following workloads deploys a VLLM server with an LLM using Intel Gaudi. Refer to [ Intel Gaudi vLLM fork] ( https://github.com/HabanaAI/vllm-fork.git ) for more details.
83
+
84
+ Build the workload container image:
85
+ ```
86
+ git clone https://github.com/HabanaAI/vllm-fork.git --branch v1.18.0
87
+
88
+ cd vllm/
89
+
90
+ $ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/vllm_buildconfig.yaml
91
+
92
+ $ oc start-build vllm-workload --from-dir=./ --follow
93
+ ```
94
+ Check if the build has completed
95
+ ```
96
+ $ oc get builds
97
+ NAMESPACE NAME TYPE FROM STATUS STARTED DURATION
98
+ gaudi-validation vllm-workload-1 Docker Dockerfile Complete 7 minutes ago 4m58s
99
+
100
+ ```
101
+
102
+ Deploy the workload:
103
+ * Update the hugging face token in the ``` vllm_hf_secret.yaml ``` file, refer to [ link] ( https://huggingface.co/docs/hub/en/security-tokens ) for more details.
104
+ ```
105
+ $ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/vllm_hf_secret.yaml
106
+ ```
107
+ meta-llama/Llama-3.1-8B model is used in this deployment and the hugging face token is used to access such gated models.
108
+ * For the PV setup with NFS, refer to [ documentation] ( https://docs.openshift.com/container-platform/4.17/storage/persistent_storage/persistent-storage-nfs.html ) .
109
+ ```
110
+ $ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/vllm_deployment.yaml
111
+ ```
112
+ Create the vllm service
113
+ ```
114
+ oc expose deploy/vllm-workload
115
+ ```
116
+ Verify Output:
117
+ ```
118
+ $ oc get pods
119
+ NAME READY STATUS RESTARTS AGE
120
+ vllm-workload-1-build 0/1 Completed 0 19m
121
+ vllm-workload-55f7c6cb7b-cwj2b 1/1 Running 0 8m36s
122
+
123
+ $ oc get svc
124
+ NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
125
+ vllm-workload ClusterIP 1.2.3.4 <none> 8000/TCP 114s
126
+ ```
127
+ ```
128
+ $ oc logs vllm-workload-55f7c6cb7b-cwj2b
129
+
130
+ INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_MIN=32 (default:min)
131
+ INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_STEP=32 (default:step)
132
+ INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_MAX=256 (default:max)
133
+ INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:min)
134
+ INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:step)
135
+ INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:max)
136
+ INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:min)
137
+ INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:step)
138
+ INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:max)
139
+ INFO 10-30 19:35:53 habana_model_runner.py:691] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
140
+ INFO 10-30 19:35:53 habana_model_runner.py:696] Decode bucket config (min, step, max_warmup) bs:[32, 32, 256], block:[128, 128, 4096]
141
+ ============================= HABANA PT BRIDGE CONFIGURATION ===========================
142
+ PT_HPU_LAZY_MODE = 1
143
+ PT_RECIPE_CACHE_PATH =
144
+ PT_CACHE_FOLDER_DELETE = 0
145
+ PT_HPU_RECIPE_CACHE_CONFIG =
146
+ PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
147
+ PT_HPU_LAZY_ACC_PAR_MODE = 1
148
+ PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
149
+ PT_HPU_EAGER_PIPELINE_ENABLE = 1
150
+ PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
151
+ ---------------------------: System Configuration :---------------------------
152
+ Num CPU Cores : 160
153
+ CPU RAM : 1056371848 KB
154
+ ------------------------------------------------------------------------------
155
+ INFO 10-30 19:35:56 selector.py:85] Using HabanaAttention backend.
156
+ INFO 10-30 19:35:56 loader.py:284] Loading weights on hpu ...
157
+ INFO 10-30 19:35:56 weight_utils.py:224] Using model weights format ['*.safetensors', '*.bin', '*.pt']
158
+ Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
159
+ Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:03<00:11, 3.87s/it]
160
+ Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:07<00:07, 3.71s/it]
161
+ Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:10<00:03, 3.59s/it]
162
+ Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:11<00:00, 2.49s/it]
163
+ Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:11<00:00, 2.93s/it]
164
+ ```
165
+ Run inference requests using the service url.
166
+ ```
167
+ sh-5.1# curl "http://vllm-workload.gaudi-validation.svc.cluster.local:8000/v1/models"{"object":"list","data":[{"id":"meta-llama/Llama-3.1-8B","object":"model","created":1730317412,"owned_by":"vllm","root":"meta-llama/Llama-3.1-8B","parent":null,"max_model_len":131072,"permission":[{"id":"modelperm-452b2bd990834aa5a9416d083fcc4c9e","object":"model_permission","created":1730317412,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
168
+ ```
169
+
170
+ ```
171
+ sh-5.1# curl http://vllm-workload.gaudi-validation.svc.cluster.local:8000/v1/completions -H "Content-Type: application/json" -d '{
172
+ "model": "meta-llama/Llama-3.1-8B",
173
+ "prompt": "A constellation is a",
174
+ "max_tokens": 10
175
+ }'
176
+ {"id":"cmpl-9a0442d0da67411081837a3a32a354f2","object":"text_completion","created":1730321284,"model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":" group of individual stars that forms a pattern or figure","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":15,"completion_tokens":10}}
177
+ ```
178
+ >>>>>>> 46ef40e (tests_gaudi: Added L2 vllm workload)
0 commit comments