-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[BugFix] Fix full cuda graph slot_mapping #21228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a critical bug in the CUDA graph capturing logic for slot_mapping
. The change from slicing by num_reqs
to num_tokens
is accurate, as slot_mapping
is a per-token data structure. This prevents potential memory corruption and incorrect model outputs when using full CUDA graph capturing. The fix is well-reasoned and I approve of the change.
slot_mapping=self.input_batch. | ||
block_table[kv_cache_group_id].slot_mapping[:num_reqs]) | ||
block_table[kv_cache_group_id].slot_mapping[:num_tokens]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great catch and a critical fix. The slot_mapping
tensor maps tokens to their slots in the KV cache, so its length must be proportional to the number of tokens, not the number of requests.
Previously, slicing with [:num_reqs]
was incorrect, especially when num_tokens
(from cudagraph_capture_sizes
) was greater than max_num_seqs
. In that scenario, num_reqs
would be capped at max_num_seqs
, leading to an undersized slot_mapping
being captured in the CUDA graph. This could cause memory corruption or incorrect outputs during graph replay, as you've pointed out.
Using [:num_tokens]
correctly sizes the slot_mapping
for the dummy run, ensuring the CUDA graph is captured with the correct metadata. This prevents the weird output artifacts and ensures correctness.
CC @mgoin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find, let's get this in asap!
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Fix a bug related to #20466 and #21196, where full cudagraph may capture the
slot_mapping
of insufficient length when max_num_seqs < max_capture_size. This could lead to weird output contents that may be irrelevant to the prompts.Test Plan
No further test plan. It works fine on my local after the fix.
Test Result
(Optional) Documentation Update