You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`MIN_CACHE_HIT_PCT`| Prefix cache hit rate in percentage (0-100). Set to `0` to disable. |`60`|
43
-
|`MAX_LATENCY_ALLOWED_MS`| The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `1000000000`) to effectively ignore the latency constraint. |`500`|
43
+
|`MAX_LATENCY_ALLOWED_MS`| The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `100000000000`) to effectively ignore the latency constraint. |`500`|
44
44
|`NUM_SEQS_LIST`| A space-separated string of `max-num-seqs` values to test. |`"128 256"`|
45
45
|`NUM_BATCHED_TOKENS_LIST`| A space-separated string of `max-num-batched-tokens` values to test. |`"1024 2048 4096"`|
46
46
@@ -54,7 +54,7 @@ You must set the following variables at the top of the script before execution.
54
54
cd<FOLDER_OF_THIS_SCRIPT>
55
55
bash auto_tune.sh
56
56
```
57
-
Please note that the `bash auto_tune.sh`command cannot contain full or paritial path with keyword `vllm`, otherwise `pkill -f vllm`command will also kill this script itself.
57
+
Please note that the `bash auto_tune.sh`command cannot contain full or partial path with keyword `vllm`, otherwise `pkill -f vllm`command will also kill this script itself.
58
58
59
59
60
60
## Example Use Cases
@@ -68,7 +68,7 @@ Here are a few examples of how to configure the script for different goals:
68
68
INPUT_LEN=1800
69
69
OUTPUT_LEN=20
70
70
MIN_CACHE_HIT_PCT=0
71
-
MAX_LATENCY_ALLOWED_MS=1000000000# A very large number
71
+
MAX_LATENCY_ALLOWED_MS=100000000000# A very large number
72
72
```
73
73
74
74
#### 2. Maximize Throughput with a Latency Requirement
@@ -108,7 +108,7 @@ After the script finishes, you will find the results in a new, timestamped direc
If it cannot find the best parameters, the final row will be `best_max_num_seqs: 0, best_num_batched_tokens: 0, best_throughput: 0`, it can due to either server didn't start properly, or the latency requirement too strict.
111
+
If it cannot find the best parameters, the final row will be `best_max_num_seqs: 0, best_num_batched_tokens: 0, best_throughput: 0`. This can be due to either the server not starting properly, or the latency requirement being too strict.
112
112
113
113
- **Profiler Trace**: A directory named `profile` is created inside the log directory. It contains the profiler trace file (e.g., `.xplane.pb`for TPU or a `.json` trace for GPU) from the single best-performing run.
114
114
@@ -128,4 +128,4 @@ The script follows a systematic process to find the optimal parameters:
128
128
129
129
4. **Track Best Result**: Throughout the process, the script tracks the parameter combination that has yielded the highest valid throughput so far.
130
130
131
-
5. **Profile Collection**: For the best-performing run, the script saves the vLLM profiler output, which can be used for deep-dive performance analysis with tools like TensorBoard.
131
+
5. **Profile Collection**: For the best-performing run, the script saves the vLLM profiler output, which can be used for deep-dive performance analysis with tools like TensorBoard.
0 commit comments