From 9e53f61e83a4fed487220dd24fb4f806f9defeb0 Mon Sep 17 00:00:00 2001 From: Lu Fang Date: Mon, 7 Apr 2025 22:58:50 -0700 Subject: [PATCH] remove attn_temperature_tuning in default user guide Signed-off-by: Lu Fang --- _posts/2025-04-05-llama4.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/_posts/2025-04-05-llama4.md b/_posts/2025-04-05-llama4.md index 45c603b..0613809 100644 --- a/_posts/2025-04-05-llama4.md +++ b/_posts/2025-04-05-llama4.md @@ -35,7 +35,7 @@ VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruc ``` VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \ --tensor-parallel-size 8 \ - --max-model-len 430000 --override-generation-config='{"attn_temperature_tuning": true}' + --max-model-len 430000' ``` On 8x H200 GPUs: @@ -45,7 +45,7 @@ On 8x H200 GPUs: ``` VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \ --tensor-parallel-size 8 \ - --max-model-len 3600000 --override-generation-config='{"attn_temperature_tuning": true}' + --max-model-len 3600000' ``` * Maverick (up to 1M context): @@ -53,11 +53,9 @@ VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruc ``` VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \ --tensor-parallel-size 8 - --max-model-len 1000000 --override-generation-config='{"attn_temperature_tuning": true}' + --max-model-len 1000000' ``` -Note: we highly recommend to turn on attn_temperature_tuning to improve accuracy for long contexts longer than 32K tokens, and VLLM_DISABLE_COMPILE_CACHE=1 is required. - **Multimodality:** The Llama 4 models excel at image understanding up to 8-10 images. By default, vLLM server accepts 1 image per request. Please pass `--limit-mm-per-prompt image=10` to serve up to 10 images per request with OpenAI-compatible API. We also recommend checking out our multi-image offline inference example with Llama-4 [here](https://github.com/vllm-project/vllm/blob/v0.8.3/examples/offline_inference/vision_language_multi_image.py). @@ -74,6 +72,7 @@ While more performance enhancements are on the way, we believe the Llama 4 model * **Boost Performance & Context Length:** Set `--kv-cache-dtype fp8` to potentially double the usable context window and gain a performance boost. We observe little to no accuracy drop in relevant evaluations with this setting. * **Maximize Context Window (up to 10M):** To fully utilize the maximum context windows (up to 10M for Scout), we recommend serving across multiple nodes using tensor parallelism or pipeline parallelism. Follow our distributed inference guide [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html). +* **Improve Long Context Accuracy (\>32K):** We highly recommend adding `--override-generation-config='{"attn_temperature_tuning": true}'` to improve accuracy for contexts longer than 32K tokens. **Other Hardware Support & Quantizations:** @@ -108,4 +107,3 @@ We extend our sincere thanks to the Meta team for their implementation of the mo We also thank the AMD team for their support in enabling these models on MI300X: [Hongxia Yang](https://github.com/hongxiayang) and Weijun Jiang. The vLLM team’s performance benchmarks were run on hardware generously provided by Nebius and NVIDIA. -