From e56b92589c8e504c45245fd47dffca553b6eac49 Mon Sep 17 00:00:00 2001
From: Lucia Fang <116399278+luccafong@users.noreply.github.com>
Date: Tue, 15 Apr 2025 10:40:09 -0700
Subject: [PATCH] remove tips for attn_temperature_tuning in llama4 blog

Since we auto-enable this with max-model-len > 32 in PR https://github.com/vllm-project/vllm/pull/16439, this tip can be removed to avoid confusion.
---
 _posts/2025-04-05-llama4.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/_posts/2025-04-05-llama4.md b/_posts/2025-04-05-llama4.md
index 0613809..42aca6a 100644
--- a/_posts/2025-04-05-llama4.md
+++ b/_posts/2025-04-05-llama4.md
@@ -72,7 +72,6 @@ While more performance enhancements are on the way, we believe the Llama 4 model
 
 * **Boost Performance & Context Length:** Set `--kv-cache-dtype fp8` to potentially double the usable context window and gain a performance boost. We observe little to no accuracy drop in relevant evaluations with this setting.
 * **Maximize Context Window (up to 10M):** To fully utilize the maximum context windows (up to 10M for Scout), we recommend serving across multiple nodes using tensor parallelism or pipeline parallelism. Follow our distributed inference guide [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html).
-* **Improve Long Context Accuracy (\>32K):** We highly recommend adding `--override-generation-config='{"attn_temperature_tuning": true}'` to improve accuracy for contexts longer than 32K tokens.
 
 **Other Hardware Support & Quantizations:**