[ModelRunnerV1] Adapt kv_cache quant in v1. (#685)

whx-sjtu · hw_whx · web-flow · commit abf1faacca1d · 2025-04-28T17:36:26.000+08:00
set self.kv_cache_dtype to kv_cache_spec in model_runner_v1 in order to
support kv_cache quant in v1

Signed-off-by: hw_whx &lt;wanghexiang7@huawei.com&gt;
Co-authored-by: hw_whx &lt;wanghexiang7@huawei.com&gt;
diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py
@@ -880,7 +880,7 @@ def get_kv_cache_spec(self) -> KVCacheSpec:
                     block_size=block_size,
                     num_kv_heads=attn_module.num_kv_heads,
                     head_size=attn_module.head_size,
-                    dtype=attn_module.dtype)
+                    dtype=self.kv_cache_dtype)
             elif attn_module.attn_type in (AttentionType.ENCODER,
                                            AttentionType.ENCODER_ONLY):
                 # encoder-only attention does not need KV cache.