Skip to content

[Refactor] Use tuple as kv cache instead of tensor #1594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lidenghui1110
Copy link

@lidenghui1110 lidenghui1110 commented Jul 2, 2025

What this PR does / why we need it?

Previously, the graph mode stored KV cache as Tuple while eager mode used Tensor. To enable PD-separated zero-copy mode in later PR: #950—where both P and D nodes must share the same storage format—this PR unifies the KV cache type to Tuple for both eager and graph modes.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI passed.

[num_tokens, self.num_heads, self.kv_lora_rank],
dtype=q.dtype,
device=q.device)
k_cache = torch.cat(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried about the extra NPU memory consumption this will bring

Copy link
Author

@lidenghui1110 lidenghui1110 Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It will use extra NPU memory in torch.cat, do you have any better suggestion?
I noticed that in prefill, we alse use cat to do such things, like: line:830, so the consumption should have been calcaulated in warmup stage.
And in deepseek-r1, the consumption should be per layer kv cache size, roughly about 15G/61 = 251M, maybe it is affordable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acctualy I have no good suggestions on it, either. We seems must do this concat. But I think we can remove this until the ring attention could be enabled, right? If so, I think this change is acceptable. also cc @ganyi1996ppo

Copy link
Collaborator

@ganyi1996ppo ganyi1996ppo Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be wipe out if ringmla is public accessible, and seems most of the change in this PR already contains in the PR #950 , can you refactor it after that PR merge? Or there will be lots of conflict, which may bring more barrier for the 950 to merge....... cc @wangxiyuan @Yikun

Copy link
Author

@lidenghui1110 lidenghui1110 Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganyi1996ppo It's OK. If all the changes in this PR has been done in #950, this PR can be closed. If so, hope it can be merged asap, other works rely on kv cache will rely on it.

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
# cached_kv_c: [batch_size, max_context_len, latent_kv]
# cached_k_pe: [batch_size, max_context_len, rope_dim]
cache_kv_c = cache_kv_c_pe[:, :, :latent_kv_dim]
cache_k_pe = cache_kv_c_pe[:, :, latent_kv_dim:]
batch_size = query_lens.size(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant with L154

Suggested change
batch_size = query_lens.size(0)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

num_kv_heads, nope_dim)
rope_cache_shape = (num_blocks, block_size,
num_kv_heads, rope_dim)
nope_cache = torch.zeros(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the name unified, all lantent_kv_cache or nope_cache

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, modified to use kv_cache_nope and kv_cache_pe here. And in mla_v1.py, use kv_c and k_pe for short as before

@jianzs jianzs requested a review from Copilot July 2, 2025 13:48
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR standardizes the KV cache representation by switching from Tensor to tuple-based caches across eager and graph modes, enabling unified storage for downstream zero-copy operations.

  • Unifies KV cache initialization in model_runner_v1.py to produce separate “nope” and “rope” tensors as tuple entries.
  • Updates attention operators and MLA modules (ops/attention.py, attention/mla_v1.py, attention/attention_v1.py) to accept and destructure tuple caches.
  • Adjusts cache-length checks from .numel() to len(...) and refactors cache reshaping to work with per-block lists/tuples.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
vllm_ascend/worker/model_runner_v1.py Generates tuple-based caches, iterates per block in eager mode
vllm_ascend/ops/attention.py Changed vanilla_chunked_prefill_mla to accept tuple caches
vllm_ascend/attention/mla_v1.py Updated multiple methods to destructure Tuple[Tensor] caches
vllm_ascend/attention/attention_v1.py Switched cache checks to len(...) and tuple-based handling
Comments suppressed due to low confidence (2)

vllm_ascend/ops/attention.py:141

  • [nitpick] The annotation tuple[torch.Tensor] doesn’t specify the expected element count. For maintainability, consider tuple[torch.Tensor, torch.Tensor] to clearly indicate two cache tensors.
        kv_c_and_k_pe_cache: tuple[torch.Tensor], # (num_blocks, block_size, latent_kv/rope_dim)

vllm_ascend/attention/mla_v1.py:662

  • Ensure Tuple is imported from the typing module at the top of the file, or this annotation will raise a NameError.
        kv_c_and_k_pe_cache: Tuple[torch.Tensor],

kv_cache = torch_npu.npu_format_cast(kv_cache,
acl_format)
kv_cache_list.append(kv_cache)
kv_caches[layer_name] = kv_cache_list
Copy link
Preview

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kv_cache_list is a list but other parts of the code expect a tuple of (nope_cache, rope_cache). Consider converting to a tuple: kv_caches[layer_name] = tuple(kv_cache_list).

Suggested change
kv_caches[layer_name] = kv_cache_list
kv_caches[layer_name] = tuple(kv_cache_list)

Copilot uses AI. Check for mistakes.

rope_dim: int,
attn_metadata: AscendMLAMetadata,
prefix_output: torch.Tensor,
prefix_lse: torch.Tensor,
):
assert len(kv_c_and_k_pe_cache) > 1
Copy link
Preview

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This assertion allows any length ≥2. To enforce exactly two cache parts, consider assert len(kv_c_and_k_pe_cache) == 2 for clearer intent.

Suggested change
assert len(kv_c_and_k_pe_cache) > 1
assert len(kv_c_and_k_pe_cache) == 2

Copilot uses AI. Check for mistakes.

Comment on lines 1023 to 1024
block_table=attn_metadata.decode.
block_table, # type:ignore
Copy link
Preview

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The block_table argument is split across lines, which reduces readability. Merge into one line: block_table=attn_metadata.decode.block_table.

Suggested change
block_table=attn_metadata.decode.
block_table, # type:ignore
block_table=attn_metadata.decode.block_table, # type:ignore

Copilot uses AI. Check for mistakes.

Comment on lines 2149 to 2150
rope_cache_shape, dtype=dtype, device=self.device,
pin_memory=True)
Copy link
Preview

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Specifying pin_memory=True on a non-CPU device tensor has no effect. Consider conditionally setting pin_memory only for CPU allocations or remove it for NPU/GPU tensors.

Suggested change
rope_cache_shape, dtype=dtype, device=self.device,
pin_memory=True)
rope_cache_shape, dtype=dtype, device=self.device,
pin_memory=(self.device.type == 'cpu'))

Copilot uses AI. Check for mistakes.

@wtr0504
Copy link

wtr0504 commented Jul 3, 2025

if i choose eager mode ,does this change matters ?

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
@lidenghui1110
Copy link
Author

lidenghui1110 commented Jul 3, 2025

if i choose eager mode ,does this change matters ?

no, it won't have any influence on any mode.

@wtr0504
Copy link

wtr0504 commented Jul 3, 2025

if i choose eager mode ,does this change matters ?

no, it won't have any influence on any mode.
thank a lot

Copy link

github-actions bot commented Jul 3, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants