Skip to content

[Docs] Optimal Deployment #2768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ Learn how to use FastDeploy through our documentation:
- [Offline Inference Development](./docs/offline_inference.md)
- [Online Service Deployment](./docs/online_serving/README.md)
- [Full Supported Models List](./docs/supported_models.md)
- [Optimal Deployment](./docs/optimal_deployment/README.md)

## Supported Models

Expand Down
114 changes: 114 additions & 0 deletions docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@

# ERNIE-4.5-VL-28B-A3B-Paddle

**Note**: To enable multi-modal support, add the `--enable-mm` flag to your configuration.

## Performance Optimization Guide

To help you achieve the **best performance** with our model, here are several important parameters you may want to adjust. Please read through the following recommendations and tips:

### **Context Length**
- **Parameter**: `--max-model-len`
- **Description**: Controls the maximum context length the model can process.
- **Recommendation**: We suggest setting this to **32k tokens** (32768) for balanced performance and memory usage.
- **Advanced**: If your hardware allows and you need even longer contexts, you can set it up to **128k tokens** (131072).

⚠️ Note: Longer contexts require significantly more GPU memory. Please ensure your hardware is sufficient before increasing this value.

### **Maximum Sequence Number**
- **Parameter**: `--max-num-seqs`
- **Description**: Controls the maximum number of sequences the service can handle, supports 1~256.
- **Recommendation**: If you don't know the average number of sequences in your actual application scenario, we recommend setting it to **256**. If the average number of sequences in your application scenario is significantly less than 256, we recommend setting it to a slightly higher value than the average to further optimize memory usage and service performance.

### **Multi-Image & Multi-Video Input**
- **Parameter**: `--limit-mm-per-prompt`
- **Description**: Our model supports multiple images and videos per prompt. Use this parameter to limit the number of images and videos per request, ensuring efficient resource utilization.
- **Recommendation**: We suggest setting this to **100 images** and **100 videos** per prompt for balanced performance and memory usage.

### **Optimization Recommendations**
> **chunked prefill**
- **Parameter**: `--enable-chunked-prefill`
- **Why enable?**

Enabling chunked prefill can **reduce peak memory usage** and **increase throughput**.
- **Additional options**:
- `--max-num-batched-tokens`: Limit the maximum token count per chunk, with a recommended setting of 1,024.

> **prefix caching**
⚠️ Prefix caching is currently not supported in multi-modal mode.

### **Quantization Precision**:
- **Parameter**: `--quantization`

- **Supported Types**:
- wint4 (recommended for most users)
- wint8
- Default: bfloat16 (if no quantization parameter is set)

- **Recommendation**:
Unless you have extremely strict precision requirements, we strongly recommend using wint4 quantization. This will dramatically reduce memory footprint and improve throughput.
If you need slightly higher precision, try wint8.
Only use bfloat16 if your use case demands the highest possible accuracy, as it requires much more memory.

- **Verified Devices and Performance**

| Devices | Runnable Quantization | TPS(tok/s) | Latency(ms) |
|:----------:|:----------:|:------:|:------:|
| A30 | wint4 | 432.99 | 17396.92 |
| L20 | wint4<br>wint8 | 3311.34<br>2423.36 | 46566.81<br>60790.91 |
| H20 | wint4<br>wint8<br>bfloat16 | 3827.27<br>3578.23<br>4100.83 | 89770.14<br>95434.02<br>84543.00 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the bf16 is best?

| A100| wint4<br>wint8<br>bfloat16 | 4970.15<br>4842.86<br>3946.32 | 68316.08<br>78518.78<br>87448.57 |
| H800| wint4<br>wint8<br>bfloat16 | 7450.01<br>7455.76<br>6351.90 | 49076.18<br>49253.59<br>54309.99 |

> ⚠️ Note: Devices not verified can still run if their CPU/GPU memory meets the requirements.
### **Other Configurations**
> **gpu-memory-utilization**
- **Parameter**: `--gpu-memory-utilization`
- **Usage**: Controls the available GPU memory allocated for FastDeploy service initialization, with a default value of 0.9 (reserving 10% of GPU memory as buffer).
- **Recommendation**: It is recommended to set it to 0.9 (default). If you encounter an out-of-memory error during service stress testing, you can try lowering this value.

> **kv-cache-ratio**
- **Parameter**: `--kv-cache-ratio`
- **Usage**: It is used to control the allocation ratio of GPU memory for the kv cache. The default value is 0.75, meaning that 75% of the kv cache memory is allocated to the input.
- **Recommendation**: Theoretically, the optimal value should be set to $\frac{average\ input\ length}{average\ input\ length+average\ output\ length}$ for your application scenario. If you are unsure, you can keep the default value.

### **Example**: Single-card wint4 with 32K context length
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 256 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--kv-cache-ratio 0.75 \
--enable-chunked-prefill \
--max-num-batched-tokens 1024 \
--quantization wint4 \
--enable-mm \
```
### **Example**: Dual-GPU wint8 with 128K context length
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--max-num-seqs 256 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--kv-cache-ratio 0.75 \
--enable-chunked-prefill \
--max-num-batched-tokens 1024 \
--quantization wint8 \
--enable-mm \
```
3 changes: 3 additions & 0 deletions docs/optimal_deployment/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Optimal Deployment

- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
115 changes: 115 additions & 0 deletions docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@

# ERNIE-4.5-VL-28B-A3B-Paddle

**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。

## 性能优化指南

为了帮助您在使用本模型时达到**最佳性能**,以下是一些需要调整参数以及我们的建议。请仔细阅读以下推荐和小贴士:

### **上下文长度**
- **参数:** `--max-model-len`
- **描述:** 控制模型可处理的最大上下文长度。
- **推荐:** 考虑到性能和内存占用的平衡,我们建议设置为**32k**(32768)。
- **进阶:** 如果您的硬件允许,并且您需要更长的上下文长度,我们也支持**128k**(131072)长度的上下文。

⚠️ 注:更长的上下文会显著增加GPU显存需求,设置更长的上下文之前确保硬件资源是满足的。

### **最大序列数量**
- **参数:** `--max-num-seqs`
- **描述:** 控制服务可以处理的最大序列数量,支持1~256。
- **推荐:** 如果您不知道实际应用场景中请求的平均序列数量是多少,我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256,我们建议设置为一个略大于平均值的较小值,以进一步优化显存占用以优化服务性能。

### **多图、多视频输入**
- **参数**:`--limit-mm-per-prompt`
- **描述**:我们的模型支持单次提示词(prompt)中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。
- **推荐**:我们建议将单次提示词(prompt)中的图片和视频数量均设置为100个,以平衡性能与内存占用。

### **性能调优**
> **chunked prefill**
- **参数:**:`--enable-chunked-prefill`
- **用处:**

开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。

- **其他配置**:
- `--max-num-batched-tokens`:限制每个chunk的最大token数量,推荐设置为1024。

> **上下文缓存**

⚠️ 目前上下缓存的功能在多模态上还未支持。

### **量化精度**
- **参数:** `--quantization`

- **已支持的精度类型:**
- wint4 (适合大多数用户)
- wint8
- bfloat16 (未设置 `--quantization` 参数时,默认使用bfloat16)

- **推荐:**
- 除非您有极其严格的精度要求,否则我们强烈建议使用wint4量化。这将显著降低内存占用并提升吞吐量。
- 若需要稍高的精度,可尝试wint8。
- 仅当您的应用场景对精度有极致要求时候才尝试使用bfloat16,因为它需要更多显存。

- **验证过的设备和部分测试数据:**

| 设备 | 可运行的量化精度 | TPS(tok/s) | Latency(ms) |
|:----------:|:----------:|:------:|:------:|
| A30 | wint4 | 432.99 | 17396.92 |
| L20 | wint4<br>wint8 | 3311.34<br>2423.36 | 46566.81<br>60790.91 |
| H20 | wint4<br>wint8<br>bfloat16 | 3827.27<br>3578.23<br>4100.83 | 89770.14<br>95434.02<br>84543.00 |
| A100| wint4<br>wint8<br>bfloat16 | 4970.15<br>4842.86<br>3946.32 | 68316.08<br>78518.78<br>87448.57 |
| H800| wint4<br>wint8<br>bfloat16 | 7450.01<br>7455.76<br>6351.90 | 49076.18<br>49253.59<br>54309.99 |

> ⚠️ 注:没有验证过的设备和精度组合,只要内存和显存满足要求也是可以运行的。

### **其他配置**
> **gpu-memory-utilization**
- **参数:** `--gpu-memory-utilization`
- **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。
- **推荐:** 推荐使用默认值0.9。如果服务压测时提示显存不足,可以尝试调低该值。

> **kv-cache-ratio**
- **参数:**`--kv-cache-ratio`
- **用处:** 用于控制 kv cache 显存的分配比例,默认0.75,即75%的 kv cache 显存给输入。
- **推荐:** 理论最佳值应设置为应用场景的$\frac{平均输入长度}{平均输入长度+平均输出长度}$,如果您无法确定,保持默认值即可。

### **示例:** 单卡、wint4、32K上下文部署命令
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 256 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--kv-cache-ratio 0.75 \
--enable-chunked-prefill \
--max-num-batched-tokens 1024 \
--quantization wint4 \
--enable-mm \
```
### **示例:** 双卡、wint8、128K上下文部署命令
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--max-num-seqs 256 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--kv-cache-ratio 0.75 \
--enable-chunked-prefill \
--max-num-batched-tokens 1024 \
--quantization wint8 \
--enable-mm \
```
3 changes: 3 additions & 0 deletions docs/zh/optimal_deployment/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# 最佳实践

- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)