-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[BugFix] Fix potential cuda-graph IMA #21196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix potential cuda-graph IMA #21196
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR addresses a bug in full CUDA graph mode related to incorrect padding of slot_mapping
. The removal of the ineffective __post_init__
method is correct. However, the new logic targets an incorrect tensor. The provided critical comment suggests a fix to target the correct slot_mapping
tensors within self.input_batch.block_table
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me, I guess there shouldn't be a performance concern aside from this being potentially moved out of the cudagraph?
ec2e999
to
5d65c32
Compare
Basically we just fill more during input prep; when it was in CommonAttentionMetadata it would only fill the already sliced tensor (we want to fill the whole thing incase we pad for cuda-graphs) |
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Himanshu Jaju <hj@mistral.ai>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Cuda-graph padding happens after prepare inputs; it's safer to -1 fill here (and closer behavior to pre #20466 ). No errors reported yet this is just preventative
Test Plan
lm_eval
Test Result
(Optional) Documentation Update