[Refactor] Use tuple as kv cache instead of tensor #1594

lidenghui1110 · 2025-07-02T13:05:29Z

What this PR does / why we need it?

Previously, the graph mode stored KV cache as Tuple while eager mode used Tensor. To enable PD-separated zero-copy mode in later PR: #950—where both P and D nodes must share the same storage format—this PR unifies the KV cache type to Tuple for both eager and graph modes.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI passed.

MengqingCao · 2025-07-02T13:14:27Z

vllm_ascend/attention/mla_v1.py

+                [num_tokens, self.num_heads, self.kv_lora_rank],
+                dtype=q.dtype,
+                device=q.device)
+            k_cache = torch.cat(


I am worried about the extra NPU memory consumption this will bring

Yes. It will use extra NPU memory in torch.cat, do you have any better suggestion?
I noticed that in prefill, we alse use cat to do such things, like: line:830, so the consumption should have been calcaulated in warmup stage.
And in deepseek-r1, the consumption should be per layer kv cache size, roughly about 15G/61 = 251M, maybe it is affordable.

Acctualy I have no good suggestions on it, either. We seems must do this concat. But I think we can remove this until the ring attention could be enabled, right? If so, I think this change is acceptable. also cc @ganyi1996ppo

This line can be wipe out if ringmla is public accessible, and seems most of the change in this PR already contains in the PR #950 , can you refactor it after that PR merge? Or there will be lots of conflict, which may bring more barrier for the 950 to merge....... cc @wangxiyuan @Yikun

@ganyi1996ppo It's OK. If all the changes in this PR has been done in #950, this PR can be closed. If so, hope it can be merged asap, other works rely on kv cache will rely on it.

Signed-off-by: lidenghui <lidenghui1110@gmail.com>

MengqingCao · 2025-07-02T13:17:59Z

vllm_ascend/ops/attention.py

    # cached_kv_c: [batch_size, max_context_len, latent_kv]
    # cached_k_pe: [batch_size, max_context_len, rope_dim]
-    cache_kv_c = cache_kv_c_pe[:, :, :latent_kv_dim]
-    cache_k_pe = cache_kv_c_pe[:, :, latent_kv_dim:]
+    batch_size = query_lens.size(0)


redundant with L154

Suggested change

batch_size = query_lens.size(0)

MengqingCao · 2025-07-02T13:20:32Z

vllm_ascend/worker/model_runner_v1.py

+                                            num_kv_heads, nope_dim)
+                        rope_cache_shape = (num_blocks, block_size, 
+                                            num_kv_heads, rope_dim)
+                        nope_cache = torch.zeros(


Let's keep the name unified, all lantent_kv_cache or nope_cache

yes, modified to use kv_cache_nope and kv_cache_pe here. And in mla_v1.py, use kv_c and k_pe for short as before

Copilot

Pull Request Overview

This PR standardizes the KV cache representation by switching from Tensor to tuple-based caches across eager and graph modes, enabling unified storage for downstream zero-copy operations.

Unifies KV cache initialization in model_runner_v1.py to produce separate “nope” and “rope” tensors as tuple entries.
Updates attention operators and MLA modules (ops/attention.py, attention/mla_v1.py, attention/attention_v1.py) to accept and destructure tuple caches.
Adjusts cache-length checks from .numel() to len(...) and refactors cache reshaping to work with per-block lists/tuples.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
vllm_ascend/worker/model_runner_v1.py	Generates tuple-based caches, iterates per block in eager mode
vllm_ascend/ops/attention.py	Changed `vanilla_chunked_prefill_mla` to accept tuple caches
vllm_ascend/attention/mla_v1.py	Updated multiple methods to destructure `Tuple[Tensor]` caches
vllm_ascend/attention/attention_v1.py	Switched cache checks to `len(...)` and tuple-based handling

Comments suppressed due to low confidence (2)

vllm_ascend/ops/attention.py:141

[nitpick] The annotation tuple[torch.Tensor] doesn’t specify the expected element count. For maintainability, consider tuple[torch.Tensor, torch.Tensor] to clearly indicate two cache tensors.

        kv_c_and_k_pe_cache: tuple[torch.Tensor], # (num_blocks, block_size, latent_kv/rope_dim)

vllm_ascend/attention/mla_v1.py:662

Ensure Tuple is imported from the typing module at the top of the file, or this annotation will raise a NameError.

        kv_c_and_k_pe_cache: Tuple[torch.Tensor],

Copilot · 2025-07-02T13:50:30Z

vllm_ascend/worker/model_runner_v1.py

+                            kv_cache = torch_npu.npu_format_cast(kv_cache,
+                                                                 acl_format)
+                            kv_cache_list.append(kv_cache)
+                        kv_caches[layer_name] = kv_cache_list


kv_cache_list is a list but other parts of the code expect a tuple of (nope_cache, rope_cache). Consider converting to a tuple: kv_caches[layer_name] = tuple(kv_cache_list).

Suggested change

kv_caches[layer_name] = kv_cache_list

kv_caches[layer_name] = tuple(kv_cache_list)

Copilot · 2025-07-02T13:50:31Z

vllm_ascend/attention/mla_v1.py

        rope_dim: int,
        attn_metadata: AscendMLAMetadata,
        prefix_output: torch.Tensor,
        prefix_lse: torch.Tensor,
    ):
+        assert len(kv_c_and_k_pe_cache) > 1


[nitpick] This assertion allows any length ≥2. To enforce exactly two cache parts, consider assert len(kv_c_and_k_pe_cache) == 2 for clearer intent.

Suggested change

assert len(kv_c_and_k_pe_cache) > 1

assert len(kv_c_and_k_pe_cache) == 2

Copilot · 2025-07-02T13:50:31Z

vllm_ascend/attention/mla_v1.py

+                block_table=attn_metadata.decode.
+                block_table,  # type:ignore


[nitpick] The block_table argument is split across lines, which reduces readability. Merge into one line: block_table=attn_metadata.decode.block_table.

Suggested change

block_table=attn_metadata.decode.

block_table, # type:ignore

block_table=attn_metadata.decode.block_table, # type:ignore

Copilot · 2025-07-02T13:50:31Z

vllm_ascend/worker/model_runner_v1.py

+                            rope_cache_shape, dtype=dtype, device=self.device, 
+                            pin_memory=True)


[nitpick] Specifying pin_memory=True on a non-CPU device tensor has no effect. Consider conditionally setting pin_memory only for CPU allocations or remove it for NPU/GPU tensors.

Suggested change

rope_cache_shape, dtype=dtype, device=self.device,

pin_memory=True)

rope_cache_shape, dtype=dtype, device=self.device,

pin_memory=(self.device.type == 'cpu'))

wtr0504 · 2025-07-03T07:47:39Z

if i choose eager mode ,does this change matters ?

Signed-off-by: lidenghui <lidenghui1110@gmail.com>

lidenghui1110 · 2025-07-03T08:10:53Z

if i choose eager mode ,does this change matters ?

no, it won't have any influence on any mode.

wtr0504 · 2025-07-03T08:17:57Z

if i choose eager mode ,does this change matters ?

no, it won't have any influence on any mode.
thank a lot

github-actions · 2025-07-03T14:25:03Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

MengqingCao reviewed Jul 2, 2025

View reviewed changes

use tuple as kv cache instead of tensor

c848786

Signed-off-by: lidenghui <lidenghui1110@gmail.com>

lidenghui1110 force-pushed the tuple-kvcache branch from d893a91 to c848786 Compare July 2, 2025 13:15

MengqingCao reviewed Jul 2, 2025

View reviewed changes

jianzs requested a review from Copilot July 2, 2025 13:48

Copilot AI reviewed Jul 2, 2025

View reviewed changes

github-actions bot added the module:ops label Jul 2, 2025

fix lint

7058853

Signed-off-by: lidenghui <lidenghui1110@gmail.com>

lidenghui1110 closed this Jul 3, 2025

lidenghui1110 reopened this Jul 3, 2025

github-actions bot added the merge-conflicts label Jul 3, 2025

	kv_caches[layer_name] = kv_cache_list
	kv_caches[layer_name] = tuple(kv_cache_list)

	assert len(kv_c_and_k_pe_cache) > 1
	assert len(kv_c_and_k_pe_cache) == 2

	block_table=attn_metadata.decode.
	block_table, # type:ignore
	block_table=attn_metadata.decode.block_table, # type:ignore

		rope_cache_shape, dtype=dtype, device=self.device,
		pin_memory=True)

[Refactor] Use tuple as kv cache instead of tensor #1594

Are you sure you want to change the base?

[Refactor] Use tuple as kv cache instead of tensor #1594

Conversation

lidenghui1110 commented Jul 2, 2025 • edited by jianzs Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MengqingCao Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

lidenghui1110 Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MengqingCao Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidenghui1110 Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MengqingCao Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

lidenghui1110 Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

lidenghui1110 Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

wtr0504 commented Jul 3, 2025

Uh oh!

lidenghui1110 commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wtr0504 commented Jul 3, 2025

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

Uh oh!

lidenghui1110 commented Jul 2, 2025 •

edited by jianzs

Loading

lidenghui1110 Jul 3, 2025 •

edited

Loading

ganyi1996ppo Jul 4, 2025 •

edited

Loading

lidenghui1110 Jul 4, 2025 •

edited

Loading

lidenghui1110 commented Jul 3, 2025 •

edited

Loading