From 86fb9bd33af25f01cf0bb5efa4f4c78691bef983 Mon Sep 17 00:00:00 2001 From: ming1753 Date: Wed, 9 Jul 2025 12:07:10 +0800 Subject: [PATCH 1/5] Optimal Deployment --- README.md | 1 + .../ERNIE-4.5-VL-28B-A3B-Paddle.md | 108 ++++++++++++++++++ docs/optimal_deployment/README.md | 3 + 3 files changed, 112 insertions(+) create mode 100644 docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md create mode 100644 docs/optimal_deployment/README.md diff --git a/README.md b/README.md index b48c17a99e..ab7e17ffa3 100644 --- a/README.md +++ b/README.md @@ -61,6 +61,7 @@ Learn how to use FastDeploy through our documentation: - [Offline Inference Development](./docs/offline_inference.md) - [Online Service Deployment](./docs/online_serving/README.md) - [Full Supported Models List](./docs/supported_models.md) +- [Optimal Deployment](./docs/optimal_deployment/README.md) ## Supported Models diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md new file mode 100644 index 0000000000..37c47a2b43 --- /dev/null +++ b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -0,0 +1,108 @@ + +# ERNIE-4.5-VL-28B-A3B-Paddle + +**Note**: To enable multi-modal support, add the `--enable-mm` flag to your configuration. + +## Performance Optimization Guide + +To help you achieve the **best performance** with our model, here are several important parameters you may want to adjust. Please read through the following recommendations and tips: + +### **Context Length** +- **Parameter**: `--max-model-len` +- **Description**: Controls the maximum context length the model can process. +- **Recommendation**: We suggest setting this to **32k tokens** for balanced performance and memory usage. +- **Advanced**: If your hardware allows and you need even longer contexts, you can set it up to **128k tokens**. + + ⚠️ Note: Longer contexts require significantly more GPU memory. Please ensure your hardware is sufficient before increasing this value. + +### **Multi-Image & Multi-Video Input** +- **Parameter**: `--limit-mm-per-prompt` +- **Description**: Our model supports multiple images and videos per prompt. Use this parameter to limit the number of images and videos per request, ensuring efficient resource utilization. +- **Recommendation**: We suggest setting this to **100 images** and **100 videos** per prompt for balanced performance and memory usage. + +### **Optimization Recommendations** +> **chunked prefill** +- **Parameter**: `--enable-chunked-prefill` +- **Why enable?** + + Enabling chunked prefill can **reduce peak memory usage** and **increase inference speed**. +- **Additional options**: + - --max-num-batched-tokens + - --max-num-partial-prefills + - --max-long-partial-prefills +- **Tip**: + + The detailed workings of these auxiliary parameters are complex—feel free to use the example values provided in our documentation or scripts. + +> **prefix caching** + +⚠️ Note: Prefix caching is currently not supported in multi-modal mode. + +### **Quantization Precision**: +- **Parameter**: `--quantization` + +- **Supported Types**: + - wint4 (recommended for most users) + - wint8 + - Default: bfloat16 (if no quantization parameter is set) + +- **Recommendation**: +Unless you have extremely strict precision requirements, we strongly recommend using wint4 quantization. This will dramatically reduce memory footprint and improve throughput. +If you need slightly higher precision, try wint8. +Only use bfloat16 if your use case demands the highest possible accuracy, as it requires much more memory. + +- **Verified Devices and Performance** + +| Devices | Runnable Quantization | TPS(t/s) | Latency(ms) | +|:----------:|:----------:|:------:|:------:| +| A30 | wint4 | 432.99 | 17396.92 | +| L20 | wint4
wint8 | 3311.34
2423.36 | 46566.81
60790.91 | +| H20 | wint4
wint8
bfloat16 | 3827.27
3578.23
4100.83 | 89770.14
95434.02
84543.00 | +| A100| wint4
wint8
bfloat16 | 4970.15
4842.86
3946.32 | 68316.08
78518.78
87448.57 | +| H800| wint4
wint8
bfloat16 | 7450.01
7455.76
6351.90 | 49076.18
49253.59
54309.99 | + +> Devices not verified can still run if their RAM/VRAM meets the requirements. + + +### **Example**: Single-card wint4 with 32K context length +```shell +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --tensor-parallel-size 1 \ + --max-model-len 32768 \ + --max-num-seqs 256 \ + --limit-mm-per-prompt '{"image": 100, "video": 100}' \ + --reasoning-parser ernie-45-vl \ + --gpu-memory-utilization 0.9 \ + --kv-cache-ratio 0.8 \ + --enable-chunked-prefill \ + --max-num-batched-tokens 1024 \ + --max-num-partial-prefills 3 \ + --max-long-partial-prefills 3 \ + --quantization wint4 \ + --enable-mm \ +``` +### **Example**: Dual-GPU Wint8 with 128K Context Length Configuration +```shell +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --tensor-parallel-size 2 \ + --max-model-len 131072 \ + --max-num-seqs 256 \ + --limit-mm-per-prompt '{"image": 100, "video": 100}' \ + --reasoning-parser ernie-45-vl \ + --gpu-memory-utilization 0.9 \ + --kv-cache-ratio 0.8 \ + --enable-chunked-prefill \ + --max-num-batched-tokens 1024 \ + --max-num-partial-prefills 3 \ + --max-long-partial-prefills 3 \ + --quantization wint8 \ + --enable-mm \ +``` \ No newline at end of file diff --git a/docs/optimal_deployment/README.md b/docs/optimal_deployment/README.md new file mode 100644 index 0000000000..05dd760d67 --- /dev/null +++ b/docs/optimal_deployment/README.md @@ -0,0 +1,3 @@ +# Optimal Deployment + +- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md) \ No newline at end of file From 1fbaa59a69c564ead9c33b7a3388f037fc6d22db Mon Sep 17 00:00:00 2001 From: ming1753 Date: Wed, 9 Jul 2025 14:51:56 +0800 Subject: [PATCH 2/5] add zh doc --- .../ERNIE-4.5-VL-28B-A3B-Paddle.md | 124 ++++++++++++++++++ docs/zh/optimal_deployment/README.md | 3 + 2 files changed, 127 insertions(+) create mode 100644 docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md create mode 100644 docs/zh/optimal_deployment/README.md diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md new file mode 100644 index 0000000000..882fe9a8ea --- /dev/null +++ b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -0,0 +1,124 @@ + +# ERNIE-4.5-VL-28B-A3B-Paddle + +**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。 + +## 性能优化指南 + +为了帮助您在使用本模型时达到**最佳性能**,以下是一些需要调整参数以及我们的建议。请仔细阅读以下推荐和小贴士: + +### **上下文长度** +- **参数:** `--max-model-len` +- **描述:** 控制模型可处理的最大上下文长度。 +- **推荐:** 考虑到性能和内存占用的平衡,我们建议设置为**32k**(32768)。 +- **进阶:** 如果您的硬件允许,并且您需要更长的上下文长度,我们也支持**128k**(131072)长度的上下文。 + + ⚠️ 注:更长的上下文会显著增加GPU显存需求,设置更长的上下文之前确保硬件资源是满足的。 + +### **最大序列数量** +- **参数:** `--max-num-seqs` +- **描述:** 控制服务可以处理的最大序列数量,支持1~256。 +- **推荐:** 如果您不知道实际应用场景中请求的平均序列数量是多少,我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256,我们建议设置为一个略大于平均值的较小值,以进一步优化显存占用以优化服务性能。 + +### **多图、多视频输入** +- **参数**:`--limit-mm-per-prompt` +- **描述**:我们的模型支持单次提示词(prompt)中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。 +- **推荐**:我们建议将单次提示词(prompt)中的图片和视频数量均设置为100个,以平衡性能与内存占用。 + +### **性能调优** +> **chunked prefill** +- **参数:**:`--enable-chunked-prefill` +- **用处:** + + 开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。 + +- **其他配置**: + - `--max-num-batched-tokens`: + - `--max-num-partial-prefills`: + - `--max-long-partial-prefills`: + +> **上下文缓存** + +⚠️ 目前上下缓存的功能在多模态上还未支持。 + +### **量化精度** +- **参数:** `--quantization` + +- **已支持的精度类型:** + - wint4 (适合大多数用户) + - wint8 + - bfloat16 (未设置 `--quantization` 参数时,默认使用bfloat16) + +- **推荐:** + - 除非您有极其严格的精度要求,否则我们强烈建议使用wint4量化。这将显著降低内存占用并提升吞吐量。 + - 若需要稍高的精度,可尝试wint8。 + - 仅当您的应用场景对精度有极致要求时候才尝试使用bfloat16,因为它需要更多显存。 + +- **验证过的设备和部分测试数据:** + + | 设备 | 可运行的量化精度 | TPS(t/s) | Latency(ms) | + |:----------:|:----------:|:------:|:------:| + | A30 | wint4 | 432.99 | 17396.92 | + | L20 | wint4
wint8 | 3311.34
2423.36 | 46566.81
60790.91 | + | H20 | wint4
wint8
bfloat16 | 3827.27
3578.23
4100.83 | 89770.14
95434.02
84543.00 | + | A100| wint4
wint8
bfloat16 | 4970.15
4842.86
3946.32 | 68316.08
78518.78
87448.57 | + | H800| wint4
wint8
bfloat16 | 7450.01
7455.76
6351.90 | 49076.18
49253.59
54309.99 | + +> 没有验证过的设备和精度组合,只要内存和显存满足要求也是可以运行的。 + +### **其他配置** +> **gpu-memory-utilization** +- **参数:**`--gpu-memory-utilization` +- **用处:** + + 开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。 + +> **kv-cache-ratio** +- **参数:**`--kv-cache-ratio` +- **用处:** + + 开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。 + + +### **示例:** 单卡、wint4、32K上下文部署命令 +```shell +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --tensor-parallel-size 1 \ + --max-model-len 32768 \ + --max-num-seqs 256 \ + --limit-mm-per-prompt '{"image": 100, "video": 100}' \ + --reasoning-parser ernie-45-vl \ + --gpu-memory-utilization 0.9 \ + --kv-cache-ratio 0.8 \ + --enable-chunked-prefill \ + --max-num-batched-tokens 1024 \ + --max-num-partial-prefills 3 \ + --max-long-partial-prefills 3 \ + --quantization wint4 \ + --enable-mm \ +``` +### **示例:** 双卡、wint8、128K上下文部署命令 +```shell +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --tensor-parallel-size 2 \ + --max-model-len 131072 \ + --max-num-seqs 256 \ + --limit-mm-per-prompt '{"image": 100, "video": 100}' \ + --reasoning-parser ernie-45-vl \ + --gpu-memory-utilization 0.9 \ + --kv-cache-ratio 0.8 \ + --enable-chunked-prefill \ + --max-num-batched-tokens 1024 \ + --max-num-partial-prefills 3 \ + --max-long-partial-prefills 3 \ + --quantization wint8 \ + --enable-mm \ +``` \ No newline at end of file diff --git a/docs/zh/optimal_deployment/README.md b/docs/zh/optimal_deployment/README.md new file mode 100644 index 0000000000..1cb368d70c --- /dev/null +++ b/docs/zh/optimal_deployment/README.md @@ -0,0 +1,3 @@ +# 最佳实践 + +- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md) \ No newline at end of file From 5ff7765170d1b3dfbfdd23d3392ac9bd6f36f665 Mon Sep 17 00:00:00 2001 From: ming1753 Date: Wed, 9 Jul 2025 16:22:13 +0800 Subject: [PATCH 3/5] modify context --- .../ERNIE-4.5-VL-28B-A3B-Paddle.md | 43 +++++++++++-------- .../ERNIE-4.5-VL-28B-A3B-Paddle.md | 32 ++++++-------- 2 files changed, 38 insertions(+), 37 deletions(-) diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md index 37c47a2b43..2cc436b2e6 100644 --- a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md +++ b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -10,11 +10,16 @@ To help you achieve the **best performance** with our model, here are several im ### **Context Length** - **Parameter**: `--max-model-len` - **Description**: Controls the maximum context length the model can process. -- **Recommendation**: We suggest setting this to **32k tokens** for balanced performance and memory usage. -- **Advanced**: If your hardware allows and you need even longer contexts, you can set it up to **128k tokens**. +- **Recommendation**: We suggest setting this to **32k tokens** (32768) for balanced performance and memory usage. +- **Advanced**: If your hardware allows and you need even longer contexts, you can set it up to **128k tokens** (131072). ⚠️ Note: Longer contexts require significantly more GPU memory. Please ensure your hardware is sufficient before increasing this value. +### **Maximum Sequence Number** +- **Parameter**: `--max-num-seqs` +- **Description**: Controls the maximum number of sequences the service can handle, supports 1~256. +- **Recommendation**: If you don't know the average number of sequences in your actual application scenario, we recommend setting it to **256**. If the average number of sequences in your application scenario is significantly less than 256, we recommend setting it to a slightly higher value than the average to further optimize memory usage and service performance. + ### **Multi-Image & Multi-Video Input** - **Parameter**: `--limit-mm-per-prompt` - **Description**: Our model supports multiple images and videos per prompt. Use this parameter to limit the number of images and videos per request, ensuring efficient resource utilization. @@ -25,18 +30,13 @@ To help you achieve the **best performance** with our model, here are several im - **Parameter**: `--enable-chunked-prefill` - **Why enable?** - Enabling chunked prefill can **reduce peak memory usage** and **increase inference speed**. + Enabling chunked prefill can **reduce peak memory usage** and **increase throughput**. - **Additional options**: - - --max-num-batched-tokens - - --max-num-partial-prefills - - --max-long-partial-prefills -- **Tip**: - - The detailed workings of these auxiliary parameters are complex—feel free to use the example values provided in our documentation or scripts. + - `--max-num-batched-tokens`: Limit the maximum token count per chunk, with a recommended setting of 1,024. > **prefix caching** -⚠️ Note: Prefix caching is currently not supported in multi-modal mode. +⚠️ Prefix caching is currently not supported in multi-modal mode. ### **Quantization Precision**: - **Parameter**: `--quantization` @@ -53,7 +53,7 @@ Only use bfloat16 if your use case demands the highest possible accuracy, as it - **Verified Devices and Performance** -| Devices | Runnable Quantization | TPS(t/s) | Latency(ms) | +| Devices | Runnable Quantization | TPS(tok/s) | Latency(ms) | |:----------:|:----------:|:------:|:------:| | A30 | wint4 | 432.99 | 17396.92 | | L20 | wint4
wint8 | 3311.34
2423.36 | 46566.81
60790.91 | @@ -61,9 +61,20 @@ Only use bfloat16 if your use case demands the highest possible accuracy, as it | A100| wint4
wint8
bfloat16 | 4970.15
4842.86
3946.32 | 68316.08
78518.78
87448.57 | | H800| wint4
wint8
bfloat16 | 7450.01
7455.76
6351.90 | 49076.18
49253.59
54309.99 | -> Devices not verified can still run if their RAM/VRAM meets the requirements. +> ⚠️ Note: Devices not verified can still run if their CPU/GPU memory meets the requirements. + +### **Other Configurations** +> **gpu-memory-utilization** +- **Parameter**: `--gpu-memory-utilization` +- **Usage**: Controls the available GPU memory allocated for FastDeploy service initialization, with a default value of 0.9 (reserving 10% of GPU memory as buffer). +- **Recommendation**: It is recommended to set it to 0.9 on Nvidia Ampere GPUs, and to 0.8–0.9 on Hopper GPUs. If you encounter an out-of-memory error during service stress testing, you can try lowering this value. +> **kv-cache-ratio** +- **Parameter**: `--kv-cache-ratio` +- **Usage**: It is used to control the allocation ratio of GPU memory for the kv cache. The default value is 0.75, meaning that 75% of the kv cache memory is allocated to the input. +- **Recommendation**: Theoretically, the optimal value should be set to $\frac{average\ input\ length}{average\ input\ length+average\ output\ length}$ for your application scenario. If you are unsure, you can keep the default value. + ### **Example**: Single-card wint4 with 32K context length ```shell python -m fastdeploy.entrypoints.openai.api_server \ @@ -77,11 +88,9 @@ python -m fastdeploy.entrypoints.openai.api_server \ --limit-mm-per-prompt '{"image": 100, "video": 100}' \ --reasoning-parser ernie-45-vl \ --gpu-memory-utilization 0.9 \ - --kv-cache-ratio 0.8 \ + --kv-cache-ratio 0.75 \ --enable-chunked-prefill \ --max-num-batched-tokens 1024 \ - --max-num-partial-prefills 3 \ - --max-long-partial-prefills 3 \ --quantization wint4 \ --enable-mm \ ``` @@ -98,11 +107,9 @@ python -m fastdeploy.entrypoints.openai.api_server \ --limit-mm-per-prompt '{"image": 100, "video": 100}' \ --reasoning-parser ernie-45-vl \ --gpu-memory-utilization 0.9 \ - --kv-cache-ratio 0.8 \ + --kv-cache-ratio 0.75 \ --enable-chunked-prefill \ --max-num-batched-tokens 1024 \ - --max-num-partial-prefills 3 \ - --max-long-partial-prefills 3 \ --quantization wint8 \ --enable-mm \ ``` \ No newline at end of file diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md index 882fe9a8ea..768394e5cb 100644 --- a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md +++ b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -30,12 +30,10 @@ - **参数:**:`--enable-chunked-prefill` - **用处:** - 开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。 + 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。 - **其他配置**: - - `--max-num-batched-tokens`: - - `--max-num-partial-prefills`: - - `--max-long-partial-prefills`: + - `--max-num-batched-tokens`:限制每个chunk的最大token数量,推荐设置为1024。 > **上下文缓存** @@ -56,7 +54,7 @@ - **验证过的设备和部分测试数据:** - | 设备 | 可运行的量化精度 | TPS(t/s) | Latency(ms) | + | 设备 | 可运行的量化精度 | TPS(tok/s) | Latency(ms) | |:----------:|:----------:|:------:|:------:| | A30 | wint4 | 432.99 | 17396.92 | | L20 | wint4
wint8 | 3311.34
2423.36 | 46566.81
60790.91 | @@ -64,20 +62,20 @@ | A100| wint4
wint8
bfloat16 | 4970.15
4842.86
3946.32 | 68316.08
78518.78
87448.57 | | H800| wint4
wint8
bfloat16 | 7450.01
7455.76
6351.90 | 49076.18
49253.59
54309.99 | -> 没有验证过的设备和精度组合,只要内存和显存满足要求也是可以运行的。 +> ⚠️ 注:没有验证过的设备和精度组合,只要内存和显存满足要求也是可以运行的。 ### **其他配置** > **gpu-memory-utilization** -- **参数:**`--gpu-memory-utilization` -- **用处:** - - 开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。 +- **参数:** `--gpu-memory-utilization` +- **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。 +- **推荐:** A卡上推荐0.9,H卡上推荐0.8~0.9。如果服务压测时提示显存不足,可以尝试调低该值。 + > **kv-cache-ratio** - **参数:**`--kv-cache-ratio` -- **用处:** - - 开启 `chunked prefill` 可**降低峰值内存占用**并**提升推理速度**。 +- **用处:** 用于控制 kv cache 显存的分配比例,默认0.75,即75%的 kv cache 显存给输入。 +- **推荐:** 理论最佳值应设置为应用场景的$\frac{平均输入长度}{平均输入长度+平均输出长度}$,如果您无法确定,保持默认值即可。 + ### **示例:** 单卡、wint4、32K上下文部署命令 @@ -93,11 +91,9 @@ python -m fastdeploy.entrypoints.openai.api_server \ --limit-mm-per-prompt '{"image": 100, "video": 100}' \ --reasoning-parser ernie-45-vl \ --gpu-memory-utilization 0.9 \ - --kv-cache-ratio 0.8 \ + --kv-cache-ratio 0.75 \ --enable-chunked-prefill \ --max-num-batched-tokens 1024 \ - --max-num-partial-prefills 3 \ - --max-long-partial-prefills 3 \ --quantization wint4 \ --enable-mm \ ``` @@ -114,11 +110,9 @@ python -m fastdeploy.entrypoints.openai.api_server \ --limit-mm-per-prompt '{"image": 100, "video": 100}' \ --reasoning-parser ernie-45-vl \ --gpu-memory-utilization 0.9 \ - --kv-cache-ratio 0.8 \ + --kv-cache-ratio 0.75 \ --enable-chunked-prefill \ --max-num-batched-tokens 1024 \ - --max-num-partial-prefills 3 \ - --max-long-partial-prefills 3 \ --quantization wint8 \ --enable-mm \ ``` \ No newline at end of file From bd7705d59049c5b4ee7e16919945269cdbfa39f3 Mon Sep 17 00:00:00 2001 From: ming1753 Date: Wed, 9 Jul 2025 16:24:44 +0800 Subject: [PATCH 4/5] modify context --- docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md | 1 - docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md | 3 --- 2 files changed, 4 deletions(-) diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md index 2cc436b2e6..69982602bd 100644 --- a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md +++ b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -69,7 +69,6 @@ Only use bfloat16 if your use case demands the highest possible accuracy, as it - **Usage**: Controls the available GPU memory allocated for FastDeploy service initialization, with a default value of 0.9 (reserving 10% of GPU memory as buffer). - **Recommendation**: It is recommended to set it to 0.9 on Nvidia Ampere GPUs, and to 0.8–0.9 on Hopper GPUs. If you encounter an out-of-memory error during service stress testing, you can try lowering this value. - > **kv-cache-ratio** - **Parameter**: `--kv-cache-ratio` - **Usage**: It is used to control the allocation ratio of GPU memory for the kv cache. The default value is 0.75, meaning that 75% of the kv cache memory is allocated to the input. diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md index 768394e5cb..56a001977e 100644 --- a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md +++ b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -70,14 +70,11 @@ - **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。 - **推荐:** A卡上推荐0.9,H卡上推荐0.8~0.9。如果服务压测时提示显存不足,可以尝试调低该值。 - > **kv-cache-ratio** - **参数:**`--kv-cache-ratio` - **用处:** 用于控制 kv cache 显存的分配比例,默认0.75,即75%的 kv cache 显存给输入。 - **推荐:** 理论最佳值应设置为应用场景的$\frac{平均输入长度}{平均输入长度+平均输出长度}$,如果您无法确定,保持默认值即可。 - - ### **示例:** 单卡、wint4、32K上下文部署命令 ```shell python -m fastdeploy.entrypoints.openai.api_server \ From c24a9dfdc792170306cf04f9425b5bba8fe9d358 Mon Sep 17 00:00:00 2001 From: ming1753 Date: Wed, 9 Jul 2025 16:37:20 +0800 Subject: [PATCH 5/5] modify context --- docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md | 4 ++-- docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md index 69982602bd..34ad118a41 100644 --- a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md +++ b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -67,7 +67,7 @@ Only use bfloat16 if your use case demands the highest possible accuracy, as it > **gpu-memory-utilization** - **Parameter**: `--gpu-memory-utilization` - **Usage**: Controls the available GPU memory allocated for FastDeploy service initialization, with a default value of 0.9 (reserving 10% of GPU memory as buffer). -- **Recommendation**: It is recommended to set it to 0.9 on Nvidia Ampere GPUs, and to 0.8–0.9 on Hopper GPUs. If you encounter an out-of-memory error during service stress testing, you can try lowering this value. +- **Recommendation**: It is recommended to set it to 0.9 (default). If you encounter an out-of-memory error during service stress testing, you can try lowering this value. > **kv-cache-ratio** - **Parameter**: `--kv-cache-ratio` @@ -93,7 +93,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --quantization wint4 \ --enable-mm \ ``` -### **Example**: Dual-GPU Wint8 with 128K Context Length Configuration +### **Example**: Dual-GPU wint8 with 128K context length ```shell python -m fastdeploy.entrypoints.openai.api_server \ --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md index 56a001977e..0641f2adc1 100644 --- a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md +++ b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -68,7 +68,7 @@ > **gpu-memory-utilization** - **参数:** `--gpu-memory-utilization` - **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。 -- **推荐:** A卡上推荐0.9,H卡上推荐0.8~0.9。如果服务压测时提示显存不足,可以尝试调低该值。 +- **推荐:** 推荐使用默认值0.9。如果服务压测时提示显存不足,可以尝试调低该值。 > **kv-cache-ratio** - **参数:**`--kv-cache-ratio`