diff --git a/README.md b/README.md
index b48c17a99e..ab7e17ffa3 100644
--- a/README.md
+++ b/README.md
@@ -61,6 +61,7 @@ Learn how to use FastDeploy through our documentation:
- [Offline Inference Development](./docs/offline_inference.md)
- [Online Service Deployment](./docs/online_serving/README.md)
- [Full Supported Models List](./docs/supported_models.md)
+- [Optimal Deployment](./docs/optimal_deployment/README.md)
## Supported Models
diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
new file mode 100644
index 0000000000..34ad118a41
--- /dev/null
+++ b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -0,0 +1,114 @@
+
+# ERNIE-4.5-VL-28B-A3B-Paddle
+
+**Note**: To enable multi-modal support, add the `--enable-mm` flag to your configuration.
+
+## Performance Optimization Guide
+
+To help you achieve the **best performance** with our model, here are several important parameters you may want to adjust. Please read through the following recommendations and tips:
+
+### **Context Length**
+- **Parameter**: `--max-model-len`
+- **Description**: Controls the maximum context length the model can process.
+- **Recommendation**: We suggest setting this to **32k tokens** (32768) for balanced performance and memory usage.
+- **Advanced**: If your hardware allows and you need even longer contexts, you can set it up to **128k tokens** (131072).
+
+ ⚠️ Note: Longer contexts require significantly more GPU memory. Please ensure your hardware is sufficient before increasing this value.
+
+### **Maximum Sequence Number**
+- **Parameter**: `--max-num-seqs`
+- **Description**: Controls the maximum number of sequences the service can handle, supports 1~256.
+- **Recommendation**: If you don't know the average number of sequences in your actual application scenario, we recommend setting it to **256**. If the average number of sequences in your application scenario is significantly less than 256, we recommend setting it to a slightly higher value than the average to further optimize memory usage and service performance.
+
+### **Multi-Image & Multi-Video Input**
+- **Parameter**: `--limit-mm-per-prompt`
+- **Description**: Our model supports multiple images and videos per prompt. Use this parameter to limit the number of images and videos per request, ensuring efficient resource utilization.
+- **Recommendation**: We suggest setting this to **100 images** and **100 videos** per prompt for balanced performance and memory usage.
+
+### **Optimization Recommendations**
+> **chunked prefill**
+- **Parameter**: `--enable-chunked-prefill`
+- **Why enable?**
+
+ Enabling chunked prefill can **reduce peak memory usage** and **increase throughput**.
+- **Additional options**:
+ - `--max-num-batched-tokens`: Limit the maximum token count per chunk, with a recommended setting of 1,024.
+
+> **prefix caching**
+
+⚠️ Prefix caching is currently not supported in multi-modal mode.
+
+### **Quantization Precision**:
+- **Parameter**: `--quantization`
+
+- **Supported Types**:
+ - wint4 (recommended for most users)
+ - wint8
+ - Default: bfloat16 (if no quantization parameter is set)
+
+- **Recommendation**:
+Unless you have extremely strict precision requirements, we strongly recommend using wint4 quantization. This will dramatically reduce memory footprint and improve throughput.
+If you need slightly higher precision, try wint8.
+Only use bfloat16 if your use case demands the highest possible accuracy, as it requires much more memory.
+
+- **Verified Devices and Performance**
+
+| Devices | Runnable Quantization | TPS(tok/s) | Latency(ms) |
+|:----------:|:----------:|:------:|:------:|
+| A30 | wint4 | 432.99 | 17396.92 |
+| L20 | wint4
wint8 | 3311.34
2423.36 | 46566.81
60790.91 |
+| H20 | wint4
wint8
bfloat16 | 3827.27
3578.23
4100.83 | 89770.14
95434.02
84543.00 |
+| A100| wint4
wint8
bfloat16 | 4970.15
4842.86
3946.32 | 68316.08
78518.78
87448.57 |
+| H800| wint4
wint8
bfloat16 | 7450.01
7455.76
6351.90 | 49076.18
49253.59
54309.99 |
+
+> ⚠️ Note: Devices not verified can still run if their CPU/GPU memory meets the requirements.
+
+### **Other Configurations**
+> **gpu-memory-utilization**
+- **Parameter**: `--gpu-memory-utilization`
+- **Usage**: Controls the available GPU memory allocated for FastDeploy service initialization, with a default value of 0.9 (reserving 10% of GPU memory as buffer).
+- **Recommendation**: It is recommended to set it to 0.9 (default). If you encounter an out-of-memory error during service stress testing, you can try lowering this value.
+
+> **kv-cache-ratio**
+- **Parameter**: `--kv-cache-ratio`
+- **Usage**: It is used to control the allocation ratio of GPU memory for the kv cache. The default value is 0.75, meaning that 75% of the kv cache memory is allocated to the input.
+- **Recommendation**: Theoretically, the optimal value should be set to $\frac{average\ input\ length}{average\ input\ length+average\ output\ length}$ for your application scenario. If you are unsure, you can keep the default value.
+
+### **Example**: Single-card wint4 with 32K context length
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+ --port 8180 \
+ --metrics-port 8181 \
+ --engine-worker-queue-port 8182 \
+ --tensor-parallel-size 1 \
+ --max-model-len 32768 \
+ --max-num-seqs 256 \
+ --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+ --reasoning-parser ernie-45-vl \
+ --gpu-memory-utilization 0.9 \
+ --kv-cache-ratio 0.75 \
+ --enable-chunked-prefill \
+ --max-num-batched-tokens 1024 \
+ --quantization wint4 \
+ --enable-mm \
+```
+### **Example**: Dual-GPU wint8 with 128K context length
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+ --port 8180 \
+ --metrics-port 8181 \
+ --engine-worker-queue-port 8182 \
+ --tensor-parallel-size 2 \
+ --max-model-len 131072 \
+ --max-num-seqs 256 \
+ --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+ --reasoning-parser ernie-45-vl \
+ --gpu-memory-utilization 0.9 \
+ --kv-cache-ratio 0.75 \
+ --enable-chunked-prefill \
+ --max-num-batched-tokens 1024 \
+ --quantization wint8 \
+ --enable-mm \
+```
\ No newline at end of file
diff --git a/docs/optimal_deployment/README.md b/docs/optimal_deployment/README.md
new file mode 100644
index 0000000000..05dd760d67
--- /dev/null
+++ b/docs/optimal_deployment/README.md
@@ -0,0 +1,3 @@
+# Optimal Deployment
+
+- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
\ No newline at end of file
diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
new file mode 100644
index 0000000000..0641f2adc1
--- /dev/null
+++ b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -0,0 +1,115 @@
+
+# ERNIE-4.5-VL-28B-A3B-Paddle
+
+**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。
+
+## 性能优化指南
+
+为了帮助您在使用本模型时达到**最佳性能**,以下是一些需要调整参数以及我们的建议。请仔细阅读以下推荐和小贴士:
+
+### **上下文长度**
+- **参数:** `--max-model-len`
+- **描述:** 控制模型可处理的最大上下文长度。
+- **推荐:** 考虑到性能和内存占用的平衡,我们建议设置为**32k**(32768)。
+- **进阶:** 如果您的硬件允许,并且您需要更长的上下文长度,我们也支持**128k**(131072)长度的上下文。
+
+ ⚠️ 注:更长的上下文会显著增加GPU显存需求,设置更长的上下文之前确保硬件资源是满足的。
+
+### **最大序列数量**
+- **参数:** `--max-num-seqs`
+- **描述:** 控制服务可以处理的最大序列数量,支持1~256。
+- **推荐:** 如果您不知道实际应用场景中请求的平均序列数量是多少,我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256,我们建议设置为一个略大于平均值的较小值,以进一步优化显存占用以优化服务性能。
+
+### **多图、多视频输入**
+- **参数**:`--limit-mm-per-prompt`
+- **描述**:我们的模型支持单次提示词(prompt)中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。
+- **推荐**:我们建议将单次提示词(prompt)中的图片和视频数量均设置为100个,以平衡性能与内存占用。
+
+### **性能调优**
+> **chunked prefill**
+- **参数:**:`--enable-chunked-prefill`
+- **用处:**
+
+ 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。
+
+- **其他配置**:
+ - `--max-num-batched-tokens`:限制每个chunk的最大token数量,推荐设置为1024。
+
+> **上下文缓存**
+
+⚠️ 目前上下缓存的功能在多模态上还未支持。
+
+### **量化精度**
+- **参数:** `--quantization`
+
+- **已支持的精度类型:**
+ - wint4 (适合大多数用户)
+ - wint8
+ - bfloat16 (未设置 `--quantization` 参数时,默认使用bfloat16)
+
+- **推荐:**
+ - 除非您有极其严格的精度要求,否则我们强烈建议使用wint4量化。这将显著降低内存占用并提升吞吐量。
+ - 若需要稍高的精度,可尝试wint8。
+ - 仅当您的应用场景对精度有极致要求时候才尝试使用bfloat16,因为它需要更多显存。
+
+- **验证过的设备和部分测试数据:**
+
+ | 设备 | 可运行的量化精度 | TPS(tok/s) | Latency(ms) |
+ |:----------:|:----------:|:------:|:------:|
+ | A30 | wint4 | 432.99 | 17396.92 |
+ | L20 | wint4
wint8 | 3311.34
2423.36 | 46566.81
60790.91 |
+ | H20 | wint4
wint8
bfloat16 | 3827.27
3578.23
4100.83 | 89770.14
95434.02
84543.00 |
+ | A100| wint4
wint8
bfloat16 | 4970.15
4842.86
3946.32 | 68316.08
78518.78
87448.57 |
+ | H800| wint4
wint8
bfloat16 | 7450.01
7455.76
6351.90 | 49076.18
49253.59
54309.99 |
+
+> ⚠️ 注:没有验证过的设备和精度组合,只要内存和显存满足要求也是可以运行的。
+
+### **其他配置**
+> **gpu-memory-utilization**
+- **参数:** `--gpu-memory-utilization`
+- **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。
+- **推荐:** 推荐使用默认值0.9。如果服务压测时提示显存不足,可以尝试调低该值。
+
+> **kv-cache-ratio**
+- **参数:**`--kv-cache-ratio`
+- **用处:** 用于控制 kv cache 显存的分配比例,默认0.75,即75%的 kv cache 显存给输入。
+- **推荐:** 理论最佳值应设置为应用场景的$\frac{平均输入长度}{平均输入长度+平均输出长度}$,如果您无法确定,保持默认值即可。
+
+### **示例:** 单卡、wint4、32K上下文部署命令
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+ --port 8180 \
+ --metrics-port 8181 \
+ --engine-worker-queue-port 8182 \
+ --tensor-parallel-size 1 \
+ --max-model-len 32768 \
+ --max-num-seqs 256 \
+ --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+ --reasoning-parser ernie-45-vl \
+ --gpu-memory-utilization 0.9 \
+ --kv-cache-ratio 0.75 \
+ --enable-chunked-prefill \
+ --max-num-batched-tokens 1024 \
+ --quantization wint4 \
+ --enable-mm \
+```
+### **示例:** 双卡、wint8、128K上下文部署命令
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+ --port 8180 \
+ --metrics-port 8181 \
+ --engine-worker-queue-port 8182 \
+ --tensor-parallel-size 2 \
+ --max-model-len 131072 \
+ --max-num-seqs 256 \
+ --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+ --reasoning-parser ernie-45-vl \
+ --gpu-memory-utilization 0.9 \
+ --kv-cache-ratio 0.75 \
+ --enable-chunked-prefill \
+ --max-num-batched-tokens 1024 \
+ --quantization wint8 \
+ --enable-mm \
+```
\ No newline at end of file
diff --git a/docs/zh/optimal_deployment/README.md b/docs/zh/optimal_deployment/README.md
new file mode 100644
index 0000000000..1cb368d70c
--- /dev/null
+++ b/docs/zh/optimal_deployment/README.md
@@ -0,0 +1,3 @@
+# 最佳实践
+
+- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
\ No newline at end of file