--amend

Chenyaaang · Chenyaaang · commit 1d4656972e1c · 2025-07-14T18:00:51.000Z
Signed-off-by: Chenyaaang &lt;chenyangli@google.com&gt;
diff --git a/benchmarks/auto_tune/README.md b/benchmarks/auto_tune/README.md
@@ -32,15 +32,15 @@ You must set the following variables at the top of the script before execution.
 
 | Variable | Description | Example Value |
 | --- | --- | --- |
-| `BASE` | **Required.** The absolute path to your vLLM repository directory. | `"$HOME"` |
+| `BASE` | **Required.** The absolute path to the parent directory of your vLLM repository directory. | `"$HOME"` |
 | `MODEL` | **Required.** The Hugging Face model identifier to be served by vllm. | `"meta-llama/Llama-3.1-8B-Instruct"` |
 | `SYSTEM`| **Required.** The hardware you are running on. Choices: `TPU` or `GPU`. (For other systems, it might not support saving profiles) | `"TPU"` |
 | `TP` | **Required.** The tensor-parallelism size. | `1` |
 | `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) |
 | `INPUT_LEN` | **Required.** Request input length. | `4000` |
 | `OUTPUT_LEN` | **Required.** Request output length. | `16` |
 | `MIN_CACHE_HIT_PCT` | Prefix cache hit rate in percentage (0-100). Set to `0` to disable. | `60` |
-| `MAX_LATENCY_ALLOWED_MS` | The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `1000000000`) to effectively ignore the latency constraint. | `500` |
+| `MAX_LATENCY_ALLOWED_MS` | The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `100000000000`) to effectively ignore the latency constraint. | `500` |
 | `NUM_SEQS_LIST` | A space-separated string of `max-num-seqs` values to test. | `"128 256"` |
 | `NUM_BATCHED_TOKENS_LIST` | A space-separated string of `max-num-batched-tokens` values to test. | `"1024 2048 4096"` |
 
@@ -54,7 +54,7 @@ You must set the following variables at the top of the script before execution.
     cd <FOLDER_OF_THIS_SCRIPT>
     bash auto_tune.sh
     ```
-    Please note that the `bash auto_tune.sh` command cannot contain full or paritial path with keyword `vllm`, otherwise `pkill -f vllm` command will also kill this script itself. 
+    Please note that the `bash auto_tune.sh` command cannot contain full or partial path with keyword `vllm`, otherwise `pkill -f vllm` command will also kill this script itself.
 
 
 ## Example Use Cases
@@ -68,7 +68,7 @@ Here are a few examples of how to configure the script for different goals:
     INPUT_LEN=1800
     OUTPUT_LEN=20
     MIN_CACHE_HIT_PCT=0
-    MAX_LATENCY_ALLOWED_MS=1000000000 # A very large number
+    MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number
     ```
 
 #### 2. Maximize Throughput with a Latency Requirement
@@ -108,7 +108,7 @@ After the script finishes, you will find the results in a new, timestamped direc
     ...
     best_max_num_seqs: 256, best_num_batched_tokens: 2048, best_throughput: 12.5, profile saved in: /home/user/vllm/auto-benchmark/2024_08_01_10_30/profile
     ```
-    If it cannot find the best parameters, the final row will be `best_max_num_seqs: 0, best_num_batched_tokens: 0, best_throughput: 0`, it can due to either server didn't start properly, or the latency requirement too strict. 
+    If it cannot find the best parameters, the final row will be `best_max_num_seqs: 0, best_num_batched_tokens: 0, best_throughput: 0`. This can be due to either the server not starting properly, or the latency requirement being too strict.
 
 -   **Profiler Trace**: A directory named `profile` is created inside the log directory. It contains the profiler trace file (e.g., `.xplane.pb` for TPU or a `.json` trace for GPU) from the single best-performing run.
 
@@ -128,4 +128,4 @@ The script follows a systematic process to find the optimal parameters:
 
 4.  **Track Best Result**: Throughout the process, the script tracks the parameter combination that has yielded the highest valid throughput so far.
 
-5.  **Profile Collection**: For the best-performing run, the script saves the vLLM profiler output, which can be used for deep-dive performance analysis with tools like TensorBoard.
+5.  **Profile Collection**: For the best-performing run, the script saves the vLLM profiler output, which can be used for deep-dive performance analysis with tools like TensorBoard.