|
| 1 | +# Automated vLLM Server Parameter Tuning |
| 2 | + |
| 3 | +This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate. |
| 4 | + |
| 5 | + |
| 6 | +## Table of Contents |
| 7 | +- [Prerequisites](#prerequisites) |
| 8 | +- [Configuration](#configuration) |
| 9 | +- [How to Run](#how-to-run) |
| 10 | +- [Example Use Cases](#example-use-cases) |
| 11 | +- [Output](#output) |
| 12 | +- [How It Works](#how-it-works) |
| 13 | + |
| 14 | +## Prerequisites |
| 15 | + |
| 16 | +Before running the script, please ensure the following steps are completed: |
| 17 | + |
| 18 | +1. **Clone vLLM & Set Up Branch**: Clone the vLLM repository and check out to your desired branch. |
| 19 | + ```bash |
| 20 | + git clone https://github.com/vllm-project/vllm.git |
| 21 | + cd vllm |
| 22 | + # git checkout <your-branch> |
| 23 | + ``` |
| 24 | + |
| 25 | +2. **Install Environment**: Install or update the correct running environment. For TPU usage, activate your `conda` environment and install the corresponding `torch` and `torch_xla` versions. |
| 26 | + |
| 27 | +3. **Model Configuration**: If you are using a customized model, ensure its configuration files are correctly placed and accessible. |
| 28 | + |
| 29 | +## Configuration |
| 30 | + |
| 31 | +You must set the following variables at the top of the script before execution. |
| 32 | + |
| 33 | +| Variable | Description | Example Value | |
| 34 | +| --- | --- | --- | |
| 35 | +| `BASE` | **Required.** The absolute path to the parent directory of your vLLM repository directory. | `"$HOME"` | |
| 36 | +| `MODEL` | **Required.** The Hugging Face model identifier to be served by vllm. | `"meta-llama/Llama-3.1-8B-Instruct"` | |
| 37 | +| `SYSTEM`| **Required.** The hardware you are running on. Choices: `TPU` or `GPU`. (For other systems, it might not support saving profiles) | `"TPU"` | |
| 38 | +| `TP` | **Required.** The tensor-parallelism size. | `1` | |
| 39 | +| `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) | |
| 40 | +| `INPUT_LEN` | **Required.** Request input length. | `4000` | |
| 41 | +| `OUTPUT_LEN` | **Required.** Request output length. | `16` | |
| 42 | +| `MIN_CACHE_HIT_PCT` | Prefix cache hit rate in percentage (0-100). Set to `0` to disable. | `60` | |
| 43 | +| `MAX_LATENCY_ALLOWED_MS` | The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `100000000000`) to effectively ignore the latency constraint. | `500` | |
| 44 | +| `NUM_SEQS_LIST` | A space-separated string of `max-num-seqs` values to test. | `"128 256"` | |
| 45 | +| `NUM_BATCHED_TOKENS_LIST` | A space-separated string of `max-num-batched-tokens` values to test. | `"1024 2048 4096"` | |
| 46 | + |
| 47 | +**Note**: The default `NUM_SEQS_LIST` and `NUM_BATCHED_TOKENS_LIST` are set for medium-sized inputs/outputs. For very short contexts (e.g., 20 input, 20 output tokens), you may need to test larger values for `max-num-seqs`. |
| 48 | + |
| 49 | +## How to Run |
| 50 | + |
| 51 | +1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section. |
| 52 | +2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost. |
| 53 | + ``` |
| 54 | + cd <FOLDER_OF_THIS_SCRIPT> |
| 55 | + bash auto_tune.sh |
| 56 | + ``` |
| 57 | + Please note that the `bash auto_tune.sh` command cannot contain full or partial path with keyword `vllm`, otherwise `pkill -f vllm` command will also kill this script itself. |
| 58 | + |
| 59 | + |
| 60 | +## Example Use Cases |
| 61 | + |
| 62 | +Here are a few examples of how to configure the script for different goals: |
| 63 | + |
| 64 | +#### 1. Maximize Throughput (No Latency Constraint) |
| 65 | +* **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens. |
| 66 | +* **Configuration**: |
| 67 | + ```bash |
| 68 | + INPUT_LEN=1800 |
| 69 | + OUTPUT_LEN=20 |
| 70 | + MIN_CACHE_HIT_PCT=0 |
| 71 | + MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number |
| 72 | + ``` |
| 73 | + |
| 74 | +#### 2. Maximize Throughput with a Latency Requirement |
| 75 | +* **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms. |
| 76 | +* **Configuration**: |
| 77 | + ```bash |
| 78 | + INPUT_LEN=1800 |
| 79 | + OUTPUT_LEN=20 |
| 80 | + MIN_CACHE_HIT_PCT=0 |
| 81 | + MAX_LATENCY_ALLOWED_MS=500 |
| 82 | + ``` |
| 83 | + |
| 84 | +#### 3. Maximize Throughput with Prefix Caching and Latency Requirements |
| 85 | +* **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms. |
| 86 | +* **Configuration**: |
| 87 | + ```bash |
| 88 | + INPUT_LEN=1800 |
| 89 | + OUTPUT_LEN=20 |
| 90 | + MIN_CACHE_HIT_PCT=60 |
| 91 | + MAX_LATENCY_ALLOWED_MS=500 |
| 92 | + ``` |
| 93 | + |
| 94 | +## Output |
| 95 | + |
| 96 | +After the script finishes, you will find the results in a new, timestamped directory created inside `$BASE/auto-benchmark/`. |
| 97 | + |
| 98 | +- **Log Files**: The directory (`$BASE/auto-benchmark/YYYY_MM_DD_HH_MM/`) contains detailed logs for each run: |
| 99 | + - `vllm_log_...txt`: The log output from the vLLM server for each parameter combination. |
| 100 | + - `bm_log_...txt`: The log output from the `benchmark_serving.py` script for each benchmark run. |
| 101 | + |
| 102 | +- **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found. |
| 103 | + ``` |
| 104 | + # Example result.txt content |
| 105 | + hash:a1b2c3d4... |
| 106 | + max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8 |
| 107 | + max_num_seqs: 128, max_num_batched_tokens: 4096 does not meet latency requirement 500 |
| 108 | + ... |
| 109 | + best_max_num_seqs: 256, best_num_batched_tokens: 2048, best_throughput: 12.5, profile saved in: /home/user/vllm/auto-benchmark/2024_08_01_10_30/profile |
| 110 | + ``` |
| 111 | + If it cannot find the best parameters, the final row will be `best_max_num_seqs: 0, best_num_batched_tokens: 0, best_throughput: 0`. This can be due to either the server not starting properly, or the latency requirement being too strict. |
| 112 | + |
| 113 | +- **Profiler Trace**: A directory named `profile` is created inside the log directory. It contains the profiler trace file (e.g., `.xplane.pb` for TPU or a `.json` trace for GPU) from the single best-performing run. |
| 114 | + |
| 115 | +## How It Works |
| 116 | + |
| 117 | +The script follows a systematic process to find the optimal parameters: |
| 118 | + |
| 119 | +1. **Find Max GPU Memory Utilization**: The script first determines the highest safe `gpu-memory-utilization` (starting from 0.98 and decreasing) that does not cause an Out-Of-Memory (OOM) error when launching the server. This ensures the benchmark runs use the maximum available memory without crashing. |
| 120 | + |
| 121 | +2. **Iterate and Benchmark**: It then enters a nested loop, iterating through every combination of `max-num-seqs` and `max-num-batched-tokens` provided in the configuration lists. |
| 122 | + |
| 123 | +3. **Latency-Aware Throughput Search**: For each parameter combination: |
| 124 | + * The vLLM server is started. |
| 125 | + * A benchmark is first run with an infinite request rate (`--request-rate inf`). |
| 126 | + * If the resulting P99 E2E latency is within the `MAX_LATENCY_ALLOWED_MS` limit, this throughput is considered the maximum for this configuration. |
| 127 | + * If the latency is too high, the script performs a search by iteratively decreasing the request rate until the latency constraint is met. This finds the highest sustainable throughput for the given parameters and latency requirement. |
| 128 | + |
| 129 | +4. **Track Best Result**: Throughout the process, the script tracks the parameter combination that has yielded the highest valid throughput so far. |
| 130 | + |
| 131 | +5. **Profile Collection**: For the best-performing run, the script saves the vLLM profiler output, which can be used for deep-dive performance analysis with tools like TensorBoard. |
0 commit comments