From 86fb9bd33af25f01cf0bb5efa4f4c78691bef983 Mon Sep 17 00:00:00 2001
From: ming1753 <ideaminghp@163.com>
Date: Wed, 9 Jul 2025 12:07:10 +0800
Subject: [PATCH 1/5] Optimal Deployment

---
 README.md                                     |   1 +
 .../ERNIE-4.5-VL-28B-A3B-Paddle.md            | 108 ++++++++++++++++++
 docs/optimal_deployment/README.md             |   3 +
 3 files changed, 112 insertions(+)
 create mode 100644 docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
 create mode 100644 docs/optimal_deployment/README.md
diff --git a/README.md b/README.md
index b48c17a99e..ab7e17ffa3 100644
--- a/README.md
+++ b/README.md
@@ -61,6 +61,7 @@ Learn how to use FastDeploy through our documentation:
 - [Offline Inference Development](./docs/offline_inference.md)
 - [Online Service Deployment](./docs/online_serving/README.md)
 - [Full Supported Models List](./docs/supported_models.md)
+- [Optimal Deployment](./docs/optimal_deployment/README.md)
 
 ## Supported Models
 
diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
new file mode 100644
index 0000000000..37c47a2b43
--- /dev/null
+++ b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -0,0 +1,108 @@
+
+# ERNIE-4.5-VL-28B-A3B-Paddle
+
+**Note**: To enable multi-modal support, add the `--enable-mm` flag to your configuration.
+
+## Performance Optimization Guide
+
+To help you achieve the **best performance** with our model, here are several important parameters you may want to adjust. Please read through the following recommendations and tips:
+
+###  **Context Length**  
+- **Parameter**: `--max-model-len`  
+- **Description**: Controls the maximum context length the model can process.
+- **Recommendation**: We suggest setting this to **32k tokens** for balanced performance and memory usage.
+- **Advanced**: If your hardware allows and you need even longer contexts, you can set it up to **128k tokens**.
+ 
+   ⚠️ Note: Longer contexts require significantly more GPU memory. Please ensure your hardware is sufficient before increasing this value.
+
+### **Multi-Image & Multi-Video Input**  
+- **Parameter**: `--limit-mm-per-prompt`  
+- **Description**: Our model supports multiple images and videos per prompt. Use this parameter to limit the number of images and videos per request, ensuring efficient resource utilization. 
+- **Recommendation**: We suggest setting this to **100 images** and **100 videos** per prompt for balanced performance and memory usage.
+
+### **Optimization Recommendations**
+> **chunked prefill**
+- **Parameter**: `--enable-chunked-prefill`
+- **Why enable?**
+              
+   Enabling chunked prefill can **reduce peak memory usage** and **increase inference speed**.
+- **Additional options**:
+   - --max-num-batched-tokens
+   - --max-num-partial-prefills
+   - --max-long-partial-prefills
+- **Tip**:
+
+    The detailed workings of these auxiliary parameters are complex—feel free to use the example values provided in our documentation or scripts.
+
+> **prefix caching**
+
+⚠️ Note: Prefix caching is currently not supported in multi-modal mode.
+
+### **Quantization Precision**:
+- **Parameter**: `--quantization`
+
+- **Supported Types**:
+  - wint4 (recommended for most users)
+  - wint8
+  - Default: bfloat16 (if no quantization parameter is set)
+
+- **Recommendation**:
+Unless you have extremely strict precision requirements, we strongly recommend using wint4 quantization. This will dramatically reduce memory footprint and improve throughput.
+If you need slightly higher precision, try wint8.
+Only use bfloat16 if your use case demands the highest possible accuracy, as it requires much more memory.
+
+- **Verified Devices and Performance**
+
+| Devices | Runnable Quantization | TPS(t/s) |  Latency(ms) |
+|:----------:|:----------:|:------:|:------:|
+| A30 | wint4 | 432.99 | 17396.92 |
+| L20 | wint4<br>wint8 | 3311.34<br>2423.36  | 46566.81<br>60790.91 |
+| H20 | wint4<br>wint8<br>bfloat16 | 3827.27<br>3578.23<br>4100.83  | 89770.14<br>95434.02<br>84543.00  |
+| A100| wint4<br>wint8<br>bfloat16 | 4970.15<br>4842.86<br>3946.32 | 68316.08<br>78518.78<br>87448.57 |
+| H800| wint4<br>wint8<br>bfloat16 | 7450.01<br>7455.76<br>6351.90 | 49076.18<br>49253.59<br>54309.99 |
+
+> Devices not verified can still run if their RAM/VRAM meets the requirements.
+
+
+### **Example**: Single-card wint4 with 32K context length 
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+       --port 8180 \
+       --metrics-port 8181 \
+       --engine-worker-queue-port 8182 \
+       --tensor-parallel-size 1 \
+       --max-model-len 32768 \
+       --max-num-seqs 256 \
+       --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+       --reasoning-parser ernie-45-vl \
+       --gpu-memory-utilization 0.9 \
+       --kv-cache-ratio 0.8 \
+       --enable-chunked-prefill \
+       --max-num-batched-tokens 1024 \
+       --max-num-partial-prefills 3 \
+       --max-long-partial-prefills 3 \
+       --quantization wint4 \
+       --enable-mm \
+```
+###  **Example**: Dual-GPU Wint8 with 128K Context Length Configuration 
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+       --port 8180 \
+       --metrics-port 8181 \
+       --engine-worker-queue-port 8182 \
+       --tensor-parallel-size 2 \
+       --max-model-len 131072 \
+       --max-num-seqs 256 \
+       --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+       --reasoning-parser ernie-45-vl \
+       --gpu-memory-utilization 0.9 \
+       --kv-cache-ratio 0.8 \
+       --enable-chunked-prefill \
+       --max-num-batched-tokens 1024 \
+       --max-num-partial-prefills 3 \
+       --max-long-partial-prefills 3 \
+       --quantization wint8 \
+       --enable-mm \
+```
\ No newline at end of file
diff --git a/docs/optimal_deployment/README.md b/docs/optimal_deployment/README.md
new file mode 100644
index 0000000000..05dd760d67
--- /dev/null
+++ b/docs/optimal_deployment/README.md
@@ -0,0 +1,3 @@
+# Optimal Deployment
+
+- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
\ No newline at end of file

From 1fbaa59a69c564ead9c33b7a3388f037fc6d22db Mon Sep 17 00:00:00 2001
From: ming1753 <ideaminghp@163.com>
Date: Wed, 9 Jul 2025 14:51:56 +0800
Subject: [PATCH 2/5] add zh doc

---
 .../ERNIE-4.5-VL-28B-A3B-Paddle.md            | 124 ++++++++++++++++++
 docs/zh/optimal_deployment/README.md          |   3 +
 2 files changed, 127 insertions(+)
 create mode 100644 docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
 create mode 100644 docs/zh/optimal_deployment/README.md

diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
new file mode 100644
index 0000000000..882fe9a8ea
--- /dev/null
+++ b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -0,0 +1,124 @@
+
+# ERNIE-4.5-VL-28B-A3B-Paddle
+
+**注意：** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。
+
+## 性能优化指南
+
+为了帮助您在使用本模型时达到**最佳性能**，以下是一些需要调整参数以及我们的建议。请仔细阅读以下推荐和小贴士：
+
+###  **上下文长度**  
+- **参数：** `--max-model-len`  
+- **描述：** 控制模型可处理的最大上下文长度。
+- **推荐：** 考虑到性能和内存占用的平衡，我们建议设置为**32k**（32768）。
+- **进阶：** 如果您的硬件允许，并且您需要更长的上下文长度，我们也支持**128k**（131072）长度的上下文。
+ 
+   ⚠️ 注：更长的上下文会显著增加GPU显存需求，设置更长的上下文之前确保硬件资源是满足的。
+
+###  **最大序列数量**  
+- **参数：** `--max-num-seqs`  
+- **描述：** 控制服务可以处理的最大序列数量，支持1～256。
+- **推荐：** 如果您不知道实际应用场景中请求的平均序列数量是多少，我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256，我们建议设置为一个略大于平均值的较小值，以进一步优化显存占用以优化服务性能。
+
+### **多图、多视频输入**  
+- **参数**：`--limit-mm-per-prompt`  
+- **描述**：我们的模型支持单次提示词（prompt）中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量，以确保资源高效利用。
+- **推荐**：我们建议将单次提示词（prompt）中的图片和视频数量均设置为100个，以平衡性能与内存占用。
+
+### **性能调优**
+> **chunked prefill**
+- **参数：**：`--enable-chunked-prefill`
+- **用处：**
+              
+   开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。
+
+- **其他配置**:
+   - `--max-num-batched-tokens`：
+   - `--max-num-partial-prefills`：
+   - `--max-long-partial-prefills`： 
+
+> **上下文缓存**
+
+⚠️ 目前上下缓存的功能在多模态上还未支持。
+
+### **量化精度**
+- **参数：** `--quantization`
+
+- **已支持的精度类型：**
+  - wint4 (适合大多数用户)
+  - wint8
+  - bfloat16 (未设置 `--quantization` 参数时，默认使用bfloat16)
+
+- **推荐：** 
+    - 除非您有极其严格的精度要求，否则我们强烈建议使用wint4量化。这将显著降低内存占用并提升吞吐量。
+    - 若需要稍高的精度，可尝试wint8。
+    - 仅当您的应用场景对精度有极致要求时候才尝试使用bfloat16，因为它需要更多显存。
+
+- **验证过的设备和部分测试数据：**
+
+    | 设备 | 可运行的量化精度 | TPS(t/s) |  Latency(ms) |
+    |:----------:|:----------:|:------:|:------:|
+    | A30 | wint4 | 432.99 | 17396.92 |
+    | L20 | wint4<br>wint8 | 3311.34<br>2423.36  | 46566.81<br>60790.91 |
+    | H20 | wint4<br>wint8<br>bfloat16 | 3827.27<br>3578.23<br>4100.83  | 89770.14<br>95434.02<br>84543.00  |
+    | A100| wint4<br>wint8<br>bfloat16 | 4970.15<br>4842.86<br>3946.32 | 68316.08<br>78518.78<br>87448.57 |
+    | H800| wint4<br>wint8<br>bfloat16 | 7450.01<br>7455.76<br>6351.90 | 49076.18<br>49253.59<br>54309.99 |
+
+> 没有验证过的设备和精度组合，只要内存和显存满足要求也是可以运行的。
+
+### **其他配置**
+> **gpu-memory-utilization**
+- **参数：**`--gpu-memory-utilization`
+- **用处：**
+              
+   开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。
+
+> **kv-cache-ratio**
+- **参数：**`--kv-cache-ratio`
+- **用处：**
+              
+   开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。
+
+
+### **示例：** 单卡、wint4、32K上下文部署命令
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+       --port 8180 \
+       --metrics-port 8181 \
+       --engine-worker-queue-port 8182 \
+       --tensor-parallel-size 1 \
+       --max-model-len 32768 \
+       --max-num-seqs 256 \
+       --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+       --reasoning-parser ernie-45-vl \
+       --gpu-memory-utilization 0.9 \
+       --kv-cache-ratio 0.8 \
+       --enable-chunked-prefill \
+       --max-num-batched-tokens 1024 \
+       --max-num-partial-prefills 3 \
+       --max-long-partial-prefills 3 \
+       --quantization wint4 \
+       --enable-mm \
+```
+###  **示例：** 双卡、wint8、128K上下文部署命令
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+       --port 8180 \
+       --metrics-port 8181 \
+       --engine-worker-queue-port 8182 \
+       --tensor-parallel-size 2 \
+       --max-model-len 131072 \
+       --max-num-seqs 256 \
+       --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+       --reasoning-parser ernie-45-vl \
+       --gpu-memory-utilization 0.9 \
+       --kv-cache-ratio 0.8 \
+       --enable-chunked-prefill \
+       --max-num-batched-tokens 1024 \
+       --max-num-partial-prefills 3 \
+       --max-long-partial-prefills 3 \
+       --quantization wint8 \
+       --enable-mm \
+```
\ No newline at end of file
diff --git a/docs/zh/optimal_deployment/README.md b/docs/zh/optimal_deployment/README.md
new file mode 100644
index 0000000000..1cb368d70c
--- /dev/null
+++ b/docs/zh/optimal_deployment/README.md
@@ -0,0 +1,3 @@
+# 最佳实践
+
+- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
\ No newline at end of file

From 5ff7765170d1b3dfbfdd23d3392ac9bd6f36f665 Mon Sep 17 00:00:00 2001
From: ming1753 <ideaminghp@163.com>
Date: Wed, 9 Jul 2025 16:22:13 +0800
Subject: [PATCH 3/5] modify context

---
 .../ERNIE-4.5-VL-28B-A3B-Paddle.md            | 43 +++++++++++--------
 .../ERNIE-4.5-VL-28B-A3B-Paddle.md            | 32 ++++++--------
 2 files changed, 38 insertions(+), 37 deletions(-)

diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
index 37c47a2b43..2cc436b2e6 100644
--- a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
+++ b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -10,11 +10,16 @@ To help you achieve the **best performance** with our model, here are several im
 ###  **Context Length**  
 - **Parameter**: `--max-model-len`  
 - **Description**: Controls the maximum context length the model can process.
-- **Recommendation**: We suggest setting this to **32k tokens** for balanced performance and memory usage.
-- **Advanced**: If your hardware allows and you need even longer contexts, you can set it up to **128k tokens**.
+- **Recommendation**: We suggest setting this to **32k tokens** (32768) for balanced performance and memory usage.
+- **Advanced**: If your hardware allows and you need even longer contexts, you can set it up to **128k tokens** (131072).
  
    ⚠️ Note: Longer contexts require significantly more GPU memory. Please ensure your hardware is sufficient before increasing this value.
 
+###  **Maximum Sequence Number**  
+- **Parameter**: `--max-num-seqs`  
+- **Description**: Controls the maximum number of sequences the service can handle, supports 1~256.
+- **Recommendation**: If you don't know the average number of sequences in your actual application scenario, we recommend setting it to **256**. If the average number of sequences in your application scenario is significantly less than 256, we recommend setting it to a slightly higher value than the average to further optimize memory usage and service performance.
+
 ### **Multi-Image & Multi-Video Input**  
 - **Parameter**: `--limit-mm-per-prompt`  
 - **Description**: Our model supports multiple images and videos per prompt. Use this parameter to limit the number of images and videos per request, ensuring efficient resource utilization. 
@@ -25,18 +30,13 @@ To help you achieve the **best performance** with our model, here are several im
 - **Parameter**: `--enable-chunked-prefill`
 - **Why enable?**
               
-   Enabling chunked prefill can **reduce peak memory usage** and **increase inference speed**.
+   Enabling chunked prefill can **reduce peak memory usage** and **increase throughput**.
 - **Additional options**:
-   - --max-num-batched-tokens
-   - --max-num-partial-prefills
-   - --max-long-partial-prefills
-- **Tip**:
-
-    The detailed workings of these auxiliary parameters are complex—feel free to use the example values provided in our documentation or scripts.
+   - `--max-num-batched-tokens`: Limit the maximum token count per chunk, with a recommended setting of 1,024.
 
 > **prefix caching**
 
-⚠️ Note: Prefix caching is currently not supported in multi-modal mode.
+⚠️ Prefix caching is currently not supported in multi-modal mode.
 
 ### **Quantization Precision**:
 - **Parameter**: `--quantization`
@@ -53,7 +53,7 @@ Only use bfloat16 if your use case demands the highest possible accuracy, as it
 
 - **Verified Devices and Performance**
 
-| Devices | Runnable Quantization | TPS(t/s) |  Latency(ms) |
+| Devices | Runnable Quantization | TPS(tok/s) |  Latency(ms) |
 |:----------:|:----------:|:------:|:------:|
 | A30 | wint4 | 432.99 | 17396.92 |
 | L20 | wint4<br>wint8 | 3311.34<br>2423.36  | 46566.81<br>60790.91 |
@@ -61,9 +61,20 @@ Only use bfloat16 if your use case demands the highest possible accuracy, as it
 | A100| wint4<br>wint8<br>bfloat16 | 4970.15<br>4842.86<br>3946.32 | 68316.08<br>78518.78<br>87448.57 |
 | H800| wint4<br>wint8<br>bfloat16 | 7450.01<br>7455.76<br>6351.90 | 49076.18<br>49253.59<br>54309.99 |
 
-> Devices not verified can still run if their RAM/VRAM meets the requirements.
+> ⚠️ Note: Devices not verified can still run if their CPU/GPU memory meets the requirements.
+
+### **Other Configurations**
+> **gpu-memory-utilization**
+- **Parameter**: `--gpu-memory-utilization`
+- **Usage**: Controls the available GPU memory allocated for FastDeploy service initialization, with a default value of 0.9 (reserving 10% of GPU memory as buffer).
+- **Recommendation**: It is recommended to set it to 0.9 on Nvidia Ampere GPUs, and to 0.8–0.9 on Hopper GPUs. If you encounter an out-of-memory error during service stress testing, you can try lowering this value.
 
 
+> **kv-cache-ratio**
+- **Parameter**: `--kv-cache-ratio`
+- **Usage**: It is used to control the allocation ratio of GPU memory for the kv cache. The default value is 0.75, meaning that 75% of the kv cache memory is allocated to the input.
+- **Recommendation**: Theoretically, the optimal value should be set to $\frac{average\ input\ length}{average\ input\ length+average\ output\ length}$ for your application scenario. If you are unsure, you can keep the default value.
+                          
 ### **Example**: Single-card wint4 with 32K context length 
 ```shell
 python -m fastdeploy.entrypoints.openai.api_server \
@@ -77,11 +88,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --limit-mm-per-prompt '{"image": 100, "video": 100}' \
        --reasoning-parser ernie-45-vl \
        --gpu-memory-utilization 0.9 \
-       --kv-cache-ratio 0.8 \
+       --kv-cache-ratio 0.75 \
        --enable-chunked-prefill \
        --max-num-batched-tokens 1024 \
-       --max-num-partial-prefills 3 \
-       --max-long-partial-prefills 3 \
        --quantization wint4 \
        --enable-mm \
 ```
@@ -98,11 +107,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --limit-mm-per-prompt '{"image": 100, "video": 100}' \
        --reasoning-parser ernie-45-vl \
        --gpu-memory-utilization 0.9 \
-       --kv-cache-ratio 0.8 \
+       --kv-cache-ratio 0.75 \
        --enable-chunked-prefill \
        --max-num-batched-tokens 1024 \
-       --max-num-partial-prefills 3 \
-       --max-long-partial-prefills 3 \
        --quantization wint8 \
        --enable-mm \
 ```
\ No newline at end of file
diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
index 882fe9a8ea..768394e5cb 100644
--- a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
+++ b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -30,12 +30,10 @@
 - **参数：**：`--enable-chunked-prefill`
 - **用处：**
               
-   开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。
+   开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。
 
 - **其他配置**:
-   - `--max-num-batched-tokens`：
-   - `--max-num-partial-prefills`：
-   - `--max-long-partial-prefills`： 
+   - `--max-num-batched-tokens`：限制每个chunk的最大token数量，推荐设置为1024。
 
 > **上下文缓存**
 
@@ -56,7 +54,7 @@
 
 - **验证过的设备和部分测试数据：**
 
-    | 设备 | 可运行的量化精度 | TPS(t/s) |  Latency(ms) |
+    | 设备 | 可运行的量化精度 | TPS(tok/s) |  Latency(ms) |
     |:----------:|:----------:|:------:|:------:|
     | A30 | wint4 | 432.99 | 17396.92 |
     | L20 | wint4<br>wint8 | 3311.34<br>2423.36  | 46566.81<br>60790.91 |
@@ -64,20 +62,20 @@
     | A100| wint4<br>wint8<br>bfloat16 | 4970.15<br>4842.86<br>3946.32 | 68316.08<br>78518.78<br>87448.57 |
     | H800| wint4<br>wint8<br>bfloat16 | 7450.01<br>7455.76<br>6351.90 | 49076.18<br>49253.59<br>54309.99 |
 
-> 没有验证过的设备和精度组合，只要内存和显存满足要求也是可以运行的。
+> ⚠️ 注：没有验证过的设备和精度组合，只要内存和显存满足要求也是可以运行的。
 
 ### **其他配置**
 > **gpu-memory-utilization**
-- **参数：**`--gpu-memory-utilization`
-- **用处：**
-              
-   开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。
+- **参数：** `--gpu-memory-utilization`
+- **用处：** 用于控制 FastDeploy 初始化服务的可用显存，默认0.9，即预留10%的显存备用。
+- **推荐：** A卡上推荐0.9，H卡上推荐0.8～0.9。如果服务压测时提示显存不足，可以尝试调低该值。
+
 
 > **kv-cache-ratio**
 - **参数：**`--kv-cache-ratio`
-- **用处：**
-              
-   开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。
+- **用处：** 用于控制 kv cache 显存的分配比例，默认0.75，即75%的 kv cache 显存给输入。
+- **推荐：** 理论最佳值应设置为应用场景的$\frac{平均输入长度}{平均输入长度+平均输出长度}$，如果您无法确定，保持默认值即可。
+                          
 
 
 ### **示例：** 单卡、wint4、32K上下文部署命令
@@ -93,11 +91,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --limit-mm-per-prompt '{"image": 100, "video": 100}' \
        --reasoning-parser ernie-45-vl \
        --gpu-memory-utilization 0.9 \
-       --kv-cache-ratio 0.8 \
+       --kv-cache-ratio 0.75 \
        --enable-chunked-prefill \
        --max-num-batched-tokens 1024 \
-       --max-num-partial-prefills 3 \
-       --max-long-partial-prefills 3 \
        --quantization wint4 \
        --enable-mm \
 ```
@@ -114,11 +110,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --limit-mm-per-prompt '{"image": 100, "video": 100}' \
        --reasoning-parser ernie-45-vl \
        --gpu-memory-utilization 0.9 \
-       --kv-cache-ratio 0.8 \
+       --kv-cache-ratio 0.75 \
        --enable-chunked-prefill \
        --max-num-batched-tokens 1024 \
-       --max-num-partial-prefills 3 \
-       --max-long-partial-prefills 3 \
        --quantization wint8 \
        --enable-mm \
 ```
\ No newline at end of file

From bd7705d59049c5b4ee7e16919945269cdbfa39f3 Mon Sep 17 00:00:00 2001
From: ming1753 <ideaminghp@163.com>
Date: Wed, 9 Jul 2025 16:24:44 +0800
Subject: [PATCH 4/5] modify context

---
 docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md    | 1 -
 docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md | 3 ---
 2 files changed, 4 deletions(-)

diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
index 2cc436b2e6..69982602bd 100644
--- a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
+++ b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -69,7 +69,6 @@ Only use bfloat16 if your use case demands the highest possible accuracy, as it
 - **Usage**: Controls the available GPU memory allocated for FastDeploy service initialization, with a default value of 0.9 (reserving 10% of GPU memory as buffer).
 - **Recommendation**: It is recommended to set it to 0.9 on Nvidia Ampere GPUs, and to 0.8–0.9 on Hopper GPUs. If you encounter an out-of-memory error during service stress testing, you can try lowering this value.
 
-
 > **kv-cache-ratio**
 - **Parameter**: `--kv-cache-ratio`
 - **Usage**: It is used to control the allocation ratio of GPU memory for the kv cache. The default value is 0.75, meaning that 75% of the kv cache memory is allocated to the input.
diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
index 768394e5cb..56a001977e 100644
--- a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
+++ b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -70,14 +70,11 @@
 - **用处：** 用于控制 FastDeploy 初始化服务的可用显存，默认0.9，即预留10%的显存备用。
 - **推荐：** A卡上推荐0.9，H卡上推荐0.8～0.9。如果服务压测时提示显存不足，可以尝试调低该值。
 
-
 > **kv-cache-ratio**
 - **参数：**`--kv-cache-ratio`
 - **用处：** 用于控制 kv cache 显存的分配比例，默认0.75，即75%的 kv cache 显存给输入。
 - **推荐：** 理论最佳值应设置为应用场景的$\frac{平均输入长度}{平均输入长度+平均输出长度}$，如果您无法确定，保持默认值即可。
                           
-
-
 ### **示例：** 单卡、wint4、32K上下文部署命令
 ```shell
 python -m fastdeploy.entrypoints.openai.api_server \

From c24a9dfdc792170306cf04f9425b5bba8fe9d358 Mon Sep 17 00:00:00 2001
From: ming1753 <ideaminghp@163.com>
Date: Wed, 9 Jul 2025 16:37:20 +0800
Subject: [PATCH 5/5] modify context

---
 docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md    | 4 ++--
 docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
index 69982602bd..34ad118a41 100644
--- a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
+++ b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -67,7 +67,7 @@ Only use bfloat16 if your use case demands the highest possible accuracy, as it
 > **gpu-memory-utilization**
 - **Parameter**: `--gpu-memory-utilization`
 - **Usage**: Controls the available GPU memory allocated for FastDeploy service initialization, with a default value of 0.9 (reserving 10% of GPU memory as buffer).
-- **Recommendation**: It is recommended to set it to 0.9 on Nvidia Ampere GPUs, and to 0.8–0.9 on Hopper GPUs. If you encounter an out-of-memory error during service stress testing, you can try lowering this value.
+- **Recommendation**: It is recommended to set it to 0.9 (default). If you encounter an out-of-memory error during service stress testing, you can try lowering this value.
 
 > **kv-cache-ratio**
 - **Parameter**: `--kv-cache-ratio`
@@ -93,7 +93,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --quantization wint4 \
        --enable-mm \
 ```
-###  **Example**: Dual-GPU Wint8 with 128K Context Length Configuration 
+###  **Example**: Dual-GPU wint8 with 128K context length 
 ```shell
 python -m fastdeploy.entrypoints.openai.api_server \
        --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
index 56a001977e..0641f2adc1 100644
--- a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
+++ b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -68,7 +68,7 @@
 > **gpu-memory-utilization**
 - **参数：** `--gpu-memory-utilization`
 - **用处：** 用于控制 FastDeploy 初始化服务的可用显存，默认0.9，即预留10%的显存备用。
-- **推荐：** A卡上推荐0.9，H卡上推荐0.8～0.9。如果服务压测时提示显存不足，可以尝试调低该值。
+- **推荐：** 推荐使用默认值0.9。如果服务压测时提示显存不足，可以尝试调低该值。
 
 > **kv-cache-ratio**
 - **参数：**`--kv-cache-ratio`