Skip to content

Commit 1c39143

Browse files
committed
add readme file for auto_tune.sh
Signed-off-by: Chenyaaang <chenyangli@google.com> --amend Signed-off-by: Chenyaaang <chenyangli@google.com>
1 parent b140416 commit 1c39143

File tree

2 files changed

+138
-30
lines changed

2 files changed

+138
-30
lines changed

benchmarks/auto_tune/README.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Automated vLLM Server Parameter Tuning
2+
3+
This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate.
4+
5+
## Table of Contents
6+
- [Prerequisites](#prerequisites)
7+
- [Configuration](#configuration)
8+
- [How to Run](#how-to-run)
9+
- [Example Use Cases](#example-use-cases)
10+
- [Output](#output)
11+
- [How It Works](#how-it-works)
12+
13+
## Prerequisites
14+
15+
Before running the script, please ensure the following steps are completed:
16+
17+
1. **Clone vLLM & Set Up Branch**: Clone the vLLM repository and check out to your desired branch.
18+
19+
```bash
20+
git clone https://github.com/vllm-project/vllm.git
21+
cd vllm
22+
# git checkout <your-branch>
23+
```
24+
25+
1. **Install Environment**: Install or update the correct running environment. For TPU usage, activate your `conda` environment and install the corresponding `torch` and `torch_xla` versions.
26+
27+
2. **Model Configuration**: If you are using a customized model, ensure its configuration files are correctly placed and accessible.
28+
29+
## Configuration
30+
31+
You must set the following variables at the top of the script before execution.
32+
33+
| Variable | Description | Example Value |
34+
| --- | --- | --- |
35+
| `BASE` | **Required.** The absolute path to the parent directory of your vLLM repository directory. | `"$HOME"` |
36+
| `MODEL` | **Required.** The Hugging Face model identifier to be served by vllm. | `"meta-llama/Llama-3.1-8B-Instruct"` |
37+
| `SYSTEM`| **Required.** The hardware you are running on. Choices: `TPU` or `GPU`. (For other systems, it might not support saving profiles) | `"TPU"` |
38+
| `TP` | **Required.** The tensor-parallelism size. | `1` |
39+
| `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) |
40+
| `INPUT_LEN` | **Required.** Request input length. | `4000` |
41+
| `OUTPUT_LEN` | **Required.** Request output length. | `16` |
42+
| `MIN_CACHE_HIT_PCT` | Prefix cache hit rate in percentage (0-100). Set to `0` to disable. | `60` |
43+
| `MAX_LATENCY_ALLOWED_MS` | The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `100000000000`) to effectively ignore the latency constraint. | `500` |
44+
| `NUM_SEQS_LIST` | A space-separated string of `max-num-seqs` values to test. | `"128 256"` |
45+
| `NUM_BATCHED_TOKENS_LIST` | A space-separated string of `max-num-batched-tokens` values to test. | `"1024 2048 4096"` |
46+
47+
**Note**: The default `NUM_SEQS_LIST` and `NUM_BATCHED_TOKENS_LIST` are set for medium-sized inputs/outputs. For very short contexts (e.g., 20 input, 20 output tokens), you may need to test larger values for `max-num-seqs`.
48+
49+
## How to Run
50+
51+
1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section.
52+
2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost.
53+
54+
```
55+
cd <FOLDER_OF_THIS_SCRIPT>
56+
bash auto_tune.sh
57+
```
58+
59+
Please note that the `bash auto_tune.sh` command cannot contain full or partial path with keyword `vllm`, otherwise `pkill -f vllm` command will also kill this script itself.
60+
61+
## Example Use Cases
62+
63+
Here are a few examples of how to configure the script for different goals:
64+
65+
### 1. Maximize Throughput (No Latency Constraint)
66+
- **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens.
67+
- **Configuration**:
68+
69+
```bash
70+
INPUT_LEN=1800
71+
OUTPUT_LEN=20
72+
MIN_CACHE_HIT_PCT=0
73+
MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number
74+
```
75+
76+
#### 2. Maximize Throughput with a Latency Requirement
77+
- **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms.
78+
- **Configuration**:
79+
80+
```bash
81+
INPUT_LEN=1800
82+
OUTPUT_LEN=20
83+
MIN_CACHE_HIT_PCT=0
84+
MAX_LATENCY_ALLOWED_MS=500
85+
```
86+
87+
#### 3. Maximize Throughput with Prefix Caching and Latency Requirements
88+
- **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms.
89+
- **Configuration**:
90+
91+
```bash
92+
INPUT_LEN=1800
93+
OUTPUT_LEN=20
94+
MIN_CACHE_HIT_PCT=60
95+
MAX_LATENCY_ALLOWED_MS=500
96+
```
97+
98+
## Output
99+
100+
After the script finishes, you will find the results in a new, timestamped directory created inside `$BASE/auto-benchmark/`.
101+
102+
- **Log Files**: The directory (`$BASE/auto-benchmark/YYYY_MM_DD_HH_MM/`) contains detailed logs for each run:
103+
- `vllm_log_...txt`: The log output from the vLLM server for each parameter combination.
104+
- `bm_log_...txt`: The log output from the `benchmark_serving.py` script for each benchmark run.
105+
106+
- **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found.
107+
108+
```
109+
# Example result.txt content
110+
hash:a1b2c3d4...
111+
max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8
112+
max_num_seqs: 128, max_num_batched_tokens: 4096 does not meet latency requirement 500
113+
...
114+
best_max_num_seqs: 256, best_num_batched_tokens: 2048, best_throughput: 12.5, profile saved in: /home/user/vllm/auto-benchmark/2024_08_01_10_30/profile
115+
```
116+
117+
If it cannot find the best parameters, the final row will be `best_max_num_seqs: 0, best_num_batched_tokens: 0, best_throughput: 0`. This can be due to either the server not starting properly, or the latency requirement being too strict.
118+
119+
- **Profiler Trace**: A directory named `profile` is created inside the log directory. It contains the profiler trace file (e.g., `.xplane.pb` for TPU or a `.json` trace for GPU) from the single best-performing run.
120+
121+
## How It Works
122+
123+
The script follows a systematic process to find the optimal parameters:
124+
125+
1. **Find Max GPU Memory Utilization**: The script first determines the highest safe `gpu-memory-utilization` (starting from 0.98 and decreasing) that does not cause an Out-Of-Memory (OOM) error when launching the server. This ensures the benchmark runs use the maximum available memory without crashing.
126+
127+
2. **Iterate and Benchmark**: It then enters a nested loop, iterating through every combination of `max-num-seqs` and `max-num-batched-tokens` provided in the configuration lists.
128+
129+
3. **Latency-Aware Throughput Search**: For each parameter combination:
130+
- The vLLM server is started.
131+
- A benchmark is first run with an infinite request rate (`--request-rate inf`).
132+
- If the resulting P99 E2E latency is within the `MAX_LATENCY_ALLOWED_MS` limit, this throughput is considered the maximum for this configuration.
133+
- If the latency is too high, the script performs a search by iteratively decreasing the request rate until the latency constraint is met. This finds the highest sustainable throughput for the given parameters and latency requirement.
134+
135+
4. **Track Best Result**: Throughout the process, the script tracks the parameter combination that has yielded the highest valid throughput so far.
136+
137+
5. **Profile Collection**: For the best-performing run, the script saves the vLLM profiler output, which can be used for deep-dive performance analysis with tools like TensorBoard.

benchmarks/auto_tune.sh renamed to benchmarks/auto_tune/auto_tune.sh

Lines changed: 1 addition & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,7 @@
11
#!/bin/bash
22

33
# This script aims to tune the best server parameter combinations to maximize throughput for given requirement.
4-
# The current server parameter combination is max_num_seqs and max_num_batched_tokens
5-
# It also supports additional requirement: e2e latency and prefix cache.
6-
7-
# Pre-requisite:
8-
# 1. Checkout to your branch, install/ update the correct running env. For TPU, activate conda env and install the corresponding torch, xla version.
9-
# 2. If the model is customized, replace the MODEL's config with the customized config.
10-
# 3. Set variables (ALL REQUIRED)
11-
# BASE: your directory for vllm repo
12-
# MODEL: the model served by vllm
13-
# SYSTEM: the hardware, choice TPU or GPU, for other systems, "get best profile" might not support.
14-
# TP: ways of tensor parallelism
15-
# DOWNLOAD_DIR: directory to download and load model weights.
16-
# INPUT_LEN: request input len
17-
# OUTPUT_LEN: request output len
18-
# MIN_CACHE_HIT_PCT: prefix cache rate
19-
# MAX_LATENCY_ALLOWED_MS: (e2e) latency requirement. If there's no latency requirement, set it to a large number like 1000000000
20-
# NUM_SEQS_LIST: a list of `max-num-seqs` you want to loop with.
21-
# NUM_BATCHED_TOKENS_LIST: a list of `max-num-batched-tokens` you want to loop with.
22-
# Note that the default NUM_SEQS_LIST and NUM_BATCHED_TOKENS_LIST are set for medium size input/output len, for extra short context (such as 20:20), you might need to include larger numbers in NUM_SEQS_LIST.
23-
# 4. Run the script, it might take a long time, you can use tmux to avoid the script stop if disconnection happens.
24-
# 5. The final result will be saved in RESULT file.
25-
26-
27-
# Example use cases
28-
# 1. Given input_len=1800, output_len=20, what's the best max_num_seqs and max_num_batched_tokens to get highest throughput?
29-
# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=0, MAX_LATENCY_ALLOWED_MS=100000000000
30-
# 2. If we have latency requirement to be lower than 500ms, what's the best server parameter?
31-
# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=0, MAX_LATENCY_ALLOWED_MS=500
32-
# 3. If we want to reach 60% prefix cache, what's the best server parameter?
33-
# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=60, MAX_LATENCY_ALLOWED_MS=500
4+
# See details in README.
345

356
TAG=$(date +"%Y_%m_%d_%H_%M")
367
BASE=""

0 commit comments

Comments
 (0)