-
Notifications
You must be signed in to change notification settings - Fork 554
[Docs] Optimal Deployment #2768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ming1753
wants to merge
5
commits into
PaddlePaddle:develop
Choose a base branch
from
ming1753:docs
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
|
||
# ERNIE-4.5-VL-28B-A3B-Paddle | ||
|
||
**Note**: To enable multi-modal support, add the `--enable-mm` flag to your configuration. | ||
|
||
## Performance Optimization Guide | ||
|
||
To help you achieve the **best performance** with our model, here are several important parameters you may want to adjust. Please read through the following recommendations and tips: | ||
|
||
### **Context Length** | ||
- **Parameter**: `--max-model-len` | ||
- **Description**: Controls the maximum context length the model can process. | ||
- **Recommendation**: We suggest setting this to **32k tokens** (32768) for balanced performance and memory usage. | ||
- **Advanced**: If your hardware allows and you need even longer contexts, you can set it up to **128k tokens** (131072). | ||
|
||
⚠️ Note: Longer contexts require significantly more GPU memory. Please ensure your hardware is sufficient before increasing this value. | ||
|
||
### **Maximum Sequence Number** | ||
- **Parameter**: `--max-num-seqs` | ||
- **Description**: Controls the maximum number of sequences the service can handle, supports 1~256. | ||
- **Recommendation**: If you don't know the average number of sequences in your actual application scenario, we recommend setting it to **256**. If the average number of sequences in your application scenario is significantly less than 256, we recommend setting it to a slightly higher value than the average to further optimize memory usage and service performance. | ||
|
||
### **Multi-Image & Multi-Video Input** | ||
- **Parameter**: `--limit-mm-per-prompt` | ||
- **Description**: Our model supports multiple images and videos per prompt. Use this parameter to limit the number of images and videos per request, ensuring efficient resource utilization. | ||
- **Recommendation**: We suggest setting this to **100 images** and **100 videos** per prompt for balanced performance and memory usage. | ||
|
||
### **Optimization Recommendations** | ||
> **chunked prefill** | ||
- **Parameter**: `--enable-chunked-prefill` | ||
- **Why enable?** | ||
|
||
Enabling chunked prefill can **reduce peak memory usage** and **increase throughput**. | ||
- **Additional options**: | ||
- `--max-num-batched-tokens`: Limit the maximum token count per chunk, with a recommended setting of 1,024. | ||
|
||
> **prefix caching** | ||
⚠️ Prefix caching is currently not supported in multi-modal mode. | ||
|
||
### **Quantization Precision**: | ||
- **Parameter**: `--quantization` | ||
|
||
- **Supported Types**: | ||
- wint4 (recommended for most users) | ||
- wint8 | ||
- Default: bfloat16 (if no quantization parameter is set) | ||
|
||
- **Recommendation**: | ||
Unless you have extremely strict precision requirements, we strongly recommend using wint4 quantization. This will dramatically reduce memory footprint and improve throughput. | ||
If you need slightly higher precision, try wint8. | ||
Only use bfloat16 if your use case demands the highest possible accuracy, as it requires much more memory. | ||
|
||
- **Verified Devices and Performance** | ||
|
||
| Devices | Runnable Quantization | TPS(tok/s) | Latency(ms) | | ||
|:----------:|:----------:|:------:|:------:| | ||
| A30 | wint4 | 432.99 | 17396.92 | | ||
| L20 | wint4<br>wint8 | 3311.34<br>2423.36 | 46566.81<br>60790.91 | | ||
| H20 | wint4<br>wint8<br>bfloat16 | 3827.27<br>3578.23<br>4100.83 | 89770.14<br>95434.02<br>84543.00 | | ||
| A100| wint4<br>wint8<br>bfloat16 | 4970.15<br>4842.86<br>3946.32 | 68316.08<br>78518.78<br>87448.57 | | ||
| H800| wint4<br>wint8<br>bfloat16 | 7450.01<br>7455.76<br>6351.90 | 49076.18<br>49253.59<br>54309.99 | | ||
|
||
> ⚠️ Note: Devices not verified can still run if their CPU/GPU memory meets the requirements. | ||
### **Other Configurations** | ||
> **gpu-memory-utilization** | ||
- **Parameter**: `--gpu-memory-utilization` | ||
- **Usage**: Controls the available GPU memory allocated for FastDeploy service initialization, with a default value of 0.9 (reserving 10% of GPU memory as buffer). | ||
- **Recommendation**: It is recommended to set it to 0.9 (default). If you encounter an out-of-memory error during service stress testing, you can try lowering this value. | ||
|
||
> **kv-cache-ratio** | ||
- **Parameter**: `--kv-cache-ratio` | ||
- **Usage**: It is used to control the allocation ratio of GPU memory for the kv cache. The default value is 0.75, meaning that 75% of the kv cache memory is allocated to the input. | ||
- **Recommendation**: Theoretically, the optimal value should be set to $\frac{average\ input\ length}{average\ input\ length+average\ output\ length}$ for your application scenario. If you are unsure, you can keep the default value. | ||
|
||
### **Example**: Single-card wint4 with 32K context length | ||
```shell | ||
python -m fastdeploy.entrypoints.openai.api_server \ | ||
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ | ||
--port 8180 \ | ||
--metrics-port 8181 \ | ||
--engine-worker-queue-port 8182 \ | ||
--tensor-parallel-size 1 \ | ||
--max-model-len 32768 \ | ||
--max-num-seqs 256 \ | ||
--limit-mm-per-prompt '{"image": 100, "video": 100}' \ | ||
--reasoning-parser ernie-45-vl \ | ||
--gpu-memory-utilization 0.9 \ | ||
--kv-cache-ratio 0.75 \ | ||
--enable-chunked-prefill \ | ||
--max-num-batched-tokens 1024 \ | ||
--quantization wint4 \ | ||
--enable-mm \ | ||
``` | ||
### **Example**: Dual-GPU wint8 with 128K context length | ||
```shell | ||
python -m fastdeploy.entrypoints.openai.api_server \ | ||
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ | ||
--port 8180 \ | ||
--metrics-port 8181 \ | ||
--engine-worker-queue-port 8182 \ | ||
--tensor-parallel-size 2 \ | ||
--max-model-len 131072 \ | ||
--max-num-seqs 256 \ | ||
--limit-mm-per-prompt '{"image": 100, "video": 100}' \ | ||
--reasoning-parser ernie-45-vl \ | ||
--gpu-memory-utilization 0.9 \ | ||
--kv-cache-ratio 0.75 \ | ||
--enable-chunked-prefill \ | ||
--max-num-batched-tokens 1024 \ | ||
--quantization wint8 \ | ||
--enable-mm \ | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# Optimal Deployment | ||
|
||
- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md) |
115 changes: 115 additions & 0 deletions
115
docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
|
||
# ERNIE-4.5-VL-28B-A3B-Paddle | ||
|
||
**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。 | ||
|
||
## 性能优化指南 | ||
|
||
为了帮助您在使用本模型时达到**最佳性能**,以下是一些需要调整参数以及我们的建议。请仔细阅读以下推荐和小贴士: | ||
|
||
### **上下文长度** | ||
- **参数:** `--max-model-len` | ||
- **描述:** 控制模型可处理的最大上下文长度。 | ||
- **推荐:** 考虑到性能和内存占用的平衡,我们建议设置为**32k**(32768)。 | ||
- **进阶:** 如果您的硬件允许,并且您需要更长的上下文长度,我们也支持**128k**(131072)长度的上下文。 | ||
|
||
⚠️ 注:更长的上下文会显著增加GPU显存需求,设置更长的上下文之前确保硬件资源是满足的。 | ||
|
||
### **最大序列数量** | ||
- **参数:** `--max-num-seqs` | ||
- **描述:** 控制服务可以处理的最大序列数量,支持1~256。 | ||
- **推荐:** 如果您不知道实际应用场景中请求的平均序列数量是多少,我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256,我们建议设置为一个略大于平均值的较小值,以进一步优化显存占用以优化服务性能。 | ||
|
||
### **多图、多视频输入** | ||
- **参数**:`--limit-mm-per-prompt` | ||
- **描述**:我们的模型支持单次提示词(prompt)中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。 | ||
- **推荐**:我们建议将单次提示词(prompt)中的图片和视频数量均设置为100个,以平衡性能与内存占用。 | ||
|
||
### **性能调优** | ||
> **chunked prefill** | ||
- **参数:**:`--enable-chunked-prefill` | ||
- **用处:** | ||
|
||
开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。 | ||
|
||
- **其他配置**: | ||
- `--max-num-batched-tokens`:限制每个chunk的最大token数量,推荐设置为1024。 | ||
|
||
> **上下文缓存** | ||
|
||
⚠️ 目前上下缓存的功能在多模态上还未支持。 | ||
|
||
### **量化精度** | ||
- **参数:** `--quantization` | ||
|
||
- **已支持的精度类型:** | ||
- wint4 (适合大多数用户) | ||
- wint8 | ||
- bfloat16 (未设置 `--quantization` 参数时,默认使用bfloat16) | ||
|
||
- **推荐:** | ||
- 除非您有极其严格的精度要求,否则我们强烈建议使用wint4量化。这将显著降低内存占用并提升吞吐量。 | ||
- 若需要稍高的精度,可尝试wint8。 | ||
- 仅当您的应用场景对精度有极致要求时候才尝试使用bfloat16,因为它需要更多显存。 | ||
|
||
- **验证过的设备和部分测试数据:** | ||
|
||
| 设备 | 可运行的量化精度 | TPS(tok/s) | Latency(ms) | | ||
|:----------:|:----------:|:------:|:------:| | ||
| A30 | wint4 | 432.99 | 17396.92 | | ||
| L20 | wint4<br>wint8 | 3311.34<br>2423.36 | 46566.81<br>60790.91 | | ||
| H20 | wint4<br>wint8<br>bfloat16 | 3827.27<br>3578.23<br>4100.83 | 89770.14<br>95434.02<br>84543.00 | | ||
| A100| wint4<br>wint8<br>bfloat16 | 4970.15<br>4842.86<br>3946.32 | 68316.08<br>78518.78<br>87448.57 | | ||
| H800| wint4<br>wint8<br>bfloat16 | 7450.01<br>7455.76<br>6351.90 | 49076.18<br>49253.59<br>54309.99 | | ||
|
||
> ⚠️ 注:没有验证过的设备和精度组合,只要内存和显存满足要求也是可以运行的。 | ||
|
||
### **其他配置** | ||
> **gpu-memory-utilization** | ||
- **参数:** `--gpu-memory-utilization` | ||
- **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。 | ||
- **推荐:** 推荐使用默认值0.9。如果服务压测时提示显存不足,可以尝试调低该值。 | ||
|
||
> **kv-cache-ratio** | ||
- **参数:**`--kv-cache-ratio` | ||
- **用处:** 用于控制 kv cache 显存的分配比例,默认0.75,即75%的 kv cache 显存给输入。 | ||
- **推荐:** 理论最佳值应设置为应用场景的$\frac{平均输入长度}{平均输入长度+平均输出长度}$,如果您无法确定,保持默认值即可。 | ||
|
||
### **示例:** 单卡、wint4、32K上下文部署命令 | ||
```shell | ||
python -m fastdeploy.entrypoints.openai.api_server \ | ||
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ | ||
--port 8180 \ | ||
--metrics-port 8181 \ | ||
--engine-worker-queue-port 8182 \ | ||
--tensor-parallel-size 1 \ | ||
--max-model-len 32768 \ | ||
--max-num-seqs 256 \ | ||
--limit-mm-per-prompt '{"image": 100, "video": 100}' \ | ||
--reasoning-parser ernie-45-vl \ | ||
--gpu-memory-utilization 0.9 \ | ||
--kv-cache-ratio 0.75 \ | ||
--enable-chunked-prefill \ | ||
--max-num-batched-tokens 1024 \ | ||
--quantization wint4 \ | ||
--enable-mm \ | ||
``` | ||
### **示例:** 双卡、wint8、128K上下文部署命令 | ||
```shell | ||
python -m fastdeploy.entrypoints.openai.api_server \ | ||
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ | ||
--port 8180 \ | ||
--metrics-port 8181 \ | ||
--engine-worker-queue-port 8182 \ | ||
--tensor-parallel-size 2 \ | ||
--max-model-len 131072 \ | ||
--max-num-seqs 256 \ | ||
--limit-mm-per-prompt '{"image": 100, "video": 100}' \ | ||
--reasoning-parser ernie-45-vl \ | ||
--gpu-memory-utilization 0.9 \ | ||
--kv-cache-ratio 0.75 \ | ||
--enable-chunked-prefill \ | ||
--max-num-batched-tokens 1024 \ | ||
--quantization wint8 \ | ||
--enable-mm \ | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# 最佳实践 | ||
|
||
- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the bf16 is best?