Skip to content

Use the unified script to start vLLM #1446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 26 additions & 12 deletions scripts/quickstart/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ This guide provides the step-by-step instructions on deploying and running DeepS
- [Single-Node Setup and Serving Deployment](#single-node-setup-and-serving-deployment)
- [Download and Install vLLM](#download-and-install-vllm)
- [HCCL Demo Test](#hccl-demo-test)
- [Adjust Workload Parameters in Start Script if Required](#adjust-workload-parameters-in-start-script-if-required)
- [The parameters of starting vLLM server script](#the-parameters-of-starting-vllm-server-script)
- [Launch vLLM Serving with TP=8](#launch-vllm-serving-with-tp8)
- [Send Requests to Ensure the Service Functionality](#send-requests-to-ensure-the-service-functionality)
- [Multi-Node Setup and Serving Deployment](#multi-node-setup-and-serving-deployment)
Expand Down Expand Up @@ -131,7 +131,7 @@ make
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 32m --test all_reduce --loop 1000 --ranks_per_node 8
```

The hccl test is passed if the message below is shown. For Gaudi PCIe system without Host NIC scale-out, the communication need go throughput CPU UPI and NW Bandwidth is less than 20GB/s.
The hccl test is passed if the message below is shown. For Gaudi PCIe system without Host NIC scale-out, the communication need go throughput CPU UPI and NW Bandwidth should be about 18GB/s.
```bash
#########################################################################################
[BENCHMARK] hcclAllReduce(dataSize=33554432, count=8388608, dtype=float, iterations=1000)
Expand All @@ -141,18 +141,33 @@ The hccl test is passed if the message below is shown. For Gaudi PCIe system wit
```


### Adjust Workload Parameters in Start Script if Required
Modify `single_16k_len.sh` (for 16k context) or `single_32k_len.sh` (for 32k context) for the fields of the following parameters based on your real workload requirements during runtime. For example, set the right model path, vLLM port and vLLM warmup cache folder.
### The parameters of starting vLLM server script
There are some system environment variables which need be set to get the best vLLM performance. We provide the sample script to set the recommended environment variables.

The script file "start_vllm.sh" is used to start vLLM service. You may execute the command below to check its supported parameters.
```bash
#To be changed
model_path=/data/hf_models/DeepSeek-R1-Gaudi
vllm_port=8688
export PT_HPU_RECIPE_CACHE_CONFIG=/data/cache/16k_cache,false,16384
bash start_vllm.sh -h
```

The command output is like below.
```bash
Start vllm server for a huggingface model on Gaudi.

Syntax: bash start_vllm.sh <-w> [-u:p:l:b:c:s] [-h]
options:
w Weights of the model, could be model id in huggingface or local path
u URL of the server, str, default=0.0.0.0
p Port number for the server, int, default=8688
l max_model_len for vllm, int, default=16384, maximal value for single node: 32768
b max_num_seqs for vllm, int, default=128
c Cache HPU recipe to the specified path, str, default=None
s Skip warmup or not, bool, default=false
h Help info
```

### Launch vLLM Serving with TP=8
```bash
bash single_16k_len.sh
bash single_vllm.sh -w /data/hf_models/DeepSeek-R1-Gaudi -u 0.0.0.0 -p 8688 -b 128 -l 16384 -c /data/warmup_cache
```

It takes more than 1 hour to load and warm up the model for the first time. After completion, a typical output would be like below. The warmup time will be accelerated if the warmup cache is re-used. vLLM server is ready to serve when the log below appears.
Expand All @@ -163,7 +178,6 @@ INFO 04-09 00:49:01 api_server.py:800] None
INFO 04-09 00:49:01 api_server.py:937] Starting vLLM API server on http://0.0.0.0:8688
INFO 04-09 00:49:01 launcher.py:23] Available routes are:
INFO 04-09 00:49:01 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET

```

### Send Requests to Ensure the Service Functionality
Expand All @@ -173,11 +187,11 @@ On bare metal, execute the following command to send a request to the Chat Compl
```bash
curl http://127.0.0.1:8688/v1/chat/completions \
-X POST \
-d '{"model": "/models/DeepSeek-R1-G2-gaudi", "messages": [{"role": "user", "content": "List 3 countries and their capitals."}], "max_tokens":128}' \
-d '{"model": "/data/hf_models/DeepSeek-R1-Gaudi", "messages": [{"role": "user", "content": "List 3 countries and their capitals."}], "max_tokens":128}' \
-H 'Content-Type: application/json'
```

If it reponses normally, refer to [Run The Benchmark](#Check-the-vLLM-performanc) and [Check the Model Accuracy](#check-the-model-accuracy) to measure the performance and accuracy.
If it reponses normally, refer to [Check the vLLM Performance](#Check-the-vLLM-performance) and [Check the Model Accuracy](#check-the-model-accuracy) to measure the performance and accuracy.


## Multi-Node Setup and Serving Deployment
Expand Down
120 changes: 0 additions & 120 deletions scripts/quickstart/single_32k_len.sh

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,74 @@
BASH_DIR=$(dirname "${BASH_SOURCE[0]}")
source "$BASH_DIR"/utils.sh

ray stop --force

#To be changed
Help() {
# Display Help
echo "Start vllm server for a huggingface model on Gaudi."
echo
echo "Syntax: bash start_vllm.sh <-w> [-u:p:l:b:c:s] [-h]"
echo "options:"
echo "w Weights of the model, could be model id in huggingface or local path"
echo "u URL of the server, str, default=0.0.0.0"
echo "p Port number for the server, int, default=8688"
echo "l max_model_len for vllm, int, default=16384, maximal value for single node: 32768"
echo "b max_num_seqs for vllm, int, default=128"
echo "c Cache HPU recipe to the specified path, str, default=None"
echo "s Skip warmup or not, bool, default=false"
echo "h Help info"
echo
}

#Default values for parameters
model_path=/data/hf_models/DeepSeek-R1-Gaudi
vllm_port=8688
export PT_HPU_RECIPE_CACHE_CONFIG=/data/16k_cache,false,16384
warmup_cache_path=/data/warmup_cache
max_num_seqs=128
host=0.0.0.0
max_model_len=16384


while getopts hw:u:p:l:b:c:s flag; do
case $flag in
h) # display Help
Help
exit
;;
w) # get model path
model_path=$OPTARG ;;
u) # get the URL of the server
host=$OPTARG ;;
p) # get the port of the server
vllm_port=$OPTARG ;;
l) # max-model-len
max_model_len=$OPTARG ;;
b) # batch size
max_num_seqs=$OPTARG ;;
c) # use_recipe_cache
warmup_cache_path=$OPTARG ;;
s) # skip_warmup
skip_warmup=true ;;
\?) # Invalid option
echo "Error: Invalid option"
Help
exit
;;
esac
done


if [ "$warmup_cache_path" != "" ]; then
echo "HPU recipe cache will be saved to $warmup_cache_path"
export PT_HPU_RECIPE_CACHE_CONFIG=${warmup_cache_path},false,16384
mkdir -p "${warmup_cache_path}"
fi

if [ "$skip_warmup" = "true" ]; then
echo "VLLM_SKIP_WARMUP is set to True"
export VLLM_SKIP_WARMUP=True
fi


ray stop --force

# check platform
if hl-smi 2>/dev/null | grep -q HL-225; then
Expand Down Expand Up @@ -41,21 +103,23 @@ block_size=128
# DO NOT change ends...

# memory footprint tunning params
export VLLM_GPU_MEMORY_UTILIZATION=0.85
export VLLM_GRAPH_RESERVED_MEM=0.1
if (( max_model_len <= 16384 )); then
export VLLM_GPU_MEMORY_UTILIZATION=0.85
else
export VLLM_GPU_MEMORY_UTILIZATION=0.75
fi
export VLLM_GRAPH_RESERVED_MEM=0.2
export VLLM_GRAPH_PROMPT_RATIO=0
export VLLM_MLA_DISABLE_REQUANTIZATION=0
export VLLM_DELAYED_SAMPLING="true"
export VLLM_MLA_PERFORM_MATRIX_ABSORPTION=0
#export VLLM_MOE_SLICE_LENGTH=20480

# params
max_model_len=16384
max_num_batched_tokens=16384
max_num_seqs=128
max_num_batched_tokens=$max_model_len
input_min=1
input_max=16384
output_max=16384
input_max=$max_model_len
output_max=$max_model_len


unset VLLM_PROMPT_BS_BUCKET_MIN VLLM_PROMPT_BS_BUCKET_STEP VLLM_PROMPT_BS_BUCKET_MAX
Expand Down Expand Up @@ -103,7 +167,7 @@ echo " environments are reseted "
env | grep VLLM


python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port $vllm_port \
python3 -m vllm.entrypoints.openai.api_server --host $host --port $vllm_port \
--block-size 128 \
--model $model_path \
--device hpu \
Expand Down
Loading