-
Notifications
You must be signed in to change notification settings - Fork 238
[Refactor] Use tuple as kv cache instead of tensor #1594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[num_tokens, self.num_heads, self.kv_lora_rank], | ||
dtype=q.dtype, | ||
device=q.device) | ||
k_cache = torch.cat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried about the extra NPU memory consumption this will bring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It will use extra NPU memory in torch.cat, do you have any better suggestion?
I noticed that in prefill, we alse use cat to do such things, like: line:830, so the consumption should have been calcaulated in warmup stage.
And in deepseek-r1, the consumption should be per layer kv cache size, roughly about 15G/61 = 251M, maybe it is affordable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acctualy I have no good suggestions on it, either. We seems must do this concat. But I think we can remove this until the ring attention could be enabled, right? If so, I think this change is acceptable. also cc @ganyi1996ppo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be wipe out if ringmla is public accessible, and seems most of the change in this PR already contains in the PR #950 , can you refactor it after that PR merge? Or there will be lots of conflict, which may bring more barrier for the 950 to merge....... cc @wangxiyuan @Yikun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ganyi1996ppo It's OK. If all the changes in this PR has been done in #950, this PR can be closed. If so, hope it can be merged asap, other works rely on kv cache will rely on it.
Signed-off-by: lidenghui <lidenghui1110@gmail.com>
d893a91
to
c848786
Compare
vllm_ascend/ops/attention.py
Outdated
# cached_kv_c: [batch_size, max_context_len, latent_kv] | ||
# cached_k_pe: [batch_size, max_context_len, rope_dim] | ||
cache_kv_c = cache_kv_c_pe[:, :, :latent_kv_dim] | ||
cache_k_pe = cache_kv_c_pe[:, :, latent_kv_dim:] | ||
batch_size = query_lens.size(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant with L154
batch_size = query_lens.size(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
num_kv_heads, nope_dim) | ||
rope_cache_shape = (num_blocks, block_size, | ||
num_kv_heads, rope_dim) | ||
nope_cache = torch.zeros( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the name unified, all lantent_kv_cache
or nope_cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, modified to use kv_cache_nope and kv_cache_pe here. And in mla_v1.py, use kv_c and k_pe for short as before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR standardizes the KV cache representation by switching from Tensor to tuple-based caches across eager and graph modes, enabling unified storage for downstream zero-copy operations.
- Unifies KV cache initialization in
model_runner_v1.py
to produce separate “nope” and “rope” tensors as tuple entries. - Updates attention operators and MLA modules (
ops/attention.py
,attention/mla_v1.py
,attention/attention_v1.py
) to accept and destructure tuple caches. - Adjusts cache-length checks from
.numel()
tolen(...)
and refactors cache reshaping to work with per-block lists/tuples.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
File | Description |
---|---|
vllm_ascend/worker/model_runner_v1.py | Generates tuple-based caches, iterates per block in eager mode |
vllm_ascend/ops/attention.py | Changed vanilla_chunked_prefill_mla to accept tuple caches |
vllm_ascend/attention/mla_v1.py | Updated multiple methods to destructure Tuple[Tensor] caches |
vllm_ascend/attention/attention_v1.py | Switched cache checks to len(...) and tuple-based handling |
Comments suppressed due to low confidence (2)
vllm_ascend/ops/attention.py:141
- [nitpick] The annotation
tuple[torch.Tensor]
doesn’t specify the expected element count. For maintainability, considertuple[torch.Tensor, torch.Tensor]
to clearly indicate two cache tensors.
kv_c_and_k_pe_cache: tuple[torch.Tensor], # (num_blocks, block_size, latent_kv/rope_dim)
vllm_ascend/attention/mla_v1.py:662
- Ensure
Tuple
is imported from thetyping
module at the top of the file, or this annotation will raise aNameError
.
kv_c_and_k_pe_cache: Tuple[torch.Tensor],
kv_cache = torch_npu.npu_format_cast(kv_cache, | ||
acl_format) | ||
kv_cache_list.append(kv_cache) | ||
kv_caches[layer_name] = kv_cache_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kv_cache_list is a list but other parts of the code expect a tuple of (nope_cache, rope_cache). Consider converting to a tuple: kv_caches[layer_name] = tuple(kv_cache_list)
.
kv_caches[layer_name] = kv_cache_list | |
kv_caches[layer_name] = tuple(kv_cache_list) |
Copilot uses AI. Check for mistakes.
rope_dim: int, | ||
attn_metadata: AscendMLAMetadata, | ||
prefix_output: torch.Tensor, | ||
prefix_lse: torch.Tensor, | ||
): | ||
assert len(kv_c_and_k_pe_cache) > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] This assertion allows any length ≥2. To enforce exactly two cache parts, consider assert len(kv_c_and_k_pe_cache) == 2
for clearer intent.
assert len(kv_c_and_k_pe_cache) > 1 | |
assert len(kv_c_and_k_pe_cache) == 2 |
Copilot uses AI. Check for mistakes.
vllm_ascend/attention/mla_v1.py
Outdated
block_table=attn_metadata.decode. | ||
block_table, # type:ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The block_table
argument is split across lines, which reduces readability. Merge into one line: block_table=attn_metadata.decode.block_table
.
block_table=attn_metadata.decode. | |
block_table, # type:ignore | |
block_table=attn_metadata.decode.block_table, # type:ignore |
Copilot uses AI. Check for mistakes.
rope_cache_shape, dtype=dtype, device=self.device, | ||
pin_memory=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Specifying pin_memory=True
on a non-CPU device tensor has no effect. Consider conditionally setting pin_memory
only for CPU allocations or remove it for NPU/GPU tensors.
rope_cache_shape, dtype=dtype, device=self.device, | |
pin_memory=True) | |
rope_cache_shape, dtype=dtype, device=self.device, | |
pin_memory=(self.device.type == 'cpu')) |
Copilot uses AI. Check for mistakes.
if i choose eager mode ,does this change matters ? |
no, it won't have any influence on any mode. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Previously, the graph mode stored KV cache as Tuple while eager mode used Tensor. To enable PD-separated zero-copy mode in later PR: #950—where both P and D nodes must share the same storage format—this PR unifies the KV cache type to Tuple for both eager and graph modes.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
CI passed.