-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
[WIP] Run eagle with full cudagraph #20190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1860,7 +1860,7 @@ def maybe_randomize_inputs(self, input_ids: torch.Tensor): | |
Randomize input_ids if VLLM_RANDOMIZE_DP_DUMMY_INPUTS is set. | ||
This is to help balance expert-selection | ||
- during profile_run | ||
- during DP rank dummy run | ||
- during DP rank dummy run | ||
""" | ||
dp_size = self.vllm_config.parallel_config.data_parallel_size | ||
randomize_inputs = envs.VLLM_RANDOMIZE_DP_DUMMY_INPUTS and dp_size > 1 | ||
|
@@ -1982,7 +1982,7 @@ def _dummy_run( | |
|
||
if self.speculative_config and self.speculative_config.use_eagle(): | ||
assert isinstance(self.drafter, EagleProposer) | ||
self.drafter.dummy_run(num_tokens) | ||
self.drafter.dummy_run(num_tokens, attn_metadata) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's my hypothesis:
Does the eagle forward pass use the tensors in the attn_metadata? If so, every time we invoke the eagle head, we may need to copy data into the tensors in the attn_metadata. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right this is partially the reason for the numerical gap. As an experiment I copied over the attn_metadata constructed for eager mode into the captured attn_metadata in latest commit:
As a result, I got better numerics but there is still a gap comparing with piecewise mode:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So it seems there might still be some discrepancy in attention computation between eager mode and cudagraph mode. Will try to investigate more and would also appreciate if you have any suggestions to check from torch.compile perspective |
||
|
||
logit_indices = np.cumsum(num_scheduled_tokens) - 1 | ||
return hidden_states, hidden_states[logit_indices] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The direct call to
json.loads
can cause the script to crash with ajson.JSONDecodeError
if an invalid JSON string is passed to the--compilation_config
argument. Consider adding a try-except block to handle potential parsing errors gracefully.