-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[Not for merge] Unshift eagle prefill #21008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: qizixi <qizixi@meta.com>
Signed-off-by: morgendave <morgendave@gmail.com>
rope change rope change rebase
Signed-off-by: morgendave <morgendave@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
@morgendave has imported this pull request. If you are a Meta employee, you can view this in D78360911. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new prefill mechanism for Eagle speculative decoding, which also adds support for multimodal models and KV cache sharing. The changes are extensive and touch core components of the engine.
My review has identified a critical issue in the test parameterization that will prevent the new tests from running correctly. Additionally, I've found a couple of significant performance concerns in performance-critical paths where Python loops are used to iterate over batch items instead of more efficient vectorized operations. Addressing these points will be important for the correctness and performance of this new feature.
tests/v1/e2e/test_spec_decode.py
Outdated
|
||
|
||
@pytest.mark.parametrize( | ||
"model_setup,mm_enabled", [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
adjusted_positions[current_pos:current_pos + | ||
req_len].copy_( | ||
target_positions[start_idx:end_idx]) | ||
|
||
adjusted_slot_mapping[current_pos:current_pos + | ||
req_len].copy_(target_slot_mapping[ | ||
start_idx:end_idx]) | ||
|
||
# Put the first prefill hidden state in the first position | ||
# and shift all the other ones, this matches the sequence | ||
# as non-shifting will include the first prefill token | ||
adjusted_hidden_states[current_pos + 1:current_pos + | ||
req_len].copy_( | ||
target_hidden_states[start_idx + | ||
1:end_idx]) | ||
|
||
adjusted_hidden_states[ | ||
current_pos] = prefill_first_hiddens[i] | ||
current_pos += req_len | ||
cu_num_tokens_index += 1 | ||
|
||
elif is_full_prefill: | ||
# Category 2: Full prefill with decode: | ||
# copy tokens and add one position | ||
pos = target_positions[end_idx - 1] + 1 | ||
block_number = pos // self.block_size | ||
block_number = block_table[i][block_number].item() | ||
block_offset = pos % self.block_size | ||
adjusted_slot = (block_number * self.block_size + block_offset) | ||
|
||
if not self.draft_prefill_kv_sharing_from_base: | ||
# copy the original and adjust the one additional token | ||
# for position, slot mapping and hidden state | ||
adjusted_token_ids[current_pos:current_pos + | ||
req_len].copy_( | ||
target_token_ids[start_idx:end_idx]) | ||
|
||
adjusted_positions[current_pos:current_pos + | ||
req_len].copy_( | ||
target_positions[start_idx:end_idx]) | ||
adjusted_positions[current_pos + req_len] = pos | ||
|
||
adjusted_slot_mapping[current_pos:current_pos + | ||
req_len].copy_(target_slot_mapping[ | ||
start_idx:end_idx]) | ||
adjusted_slot_mapping[current_pos + | ||
req_len] = (adjusted_slot) | ||
|
||
adjusted_hidden_states[ | ||
current_pos + 1:current_pos + req_len + 1].copy_( | ||
target_hidden_states[start_idx:end_idx]) | ||
|
||
adjusted_hidden_states[ | ||
current_pos] = prefill_first_hiddens[i] | ||
current_pos += req_len + 1 | ||
else: | ||
# if we enable copy kv then all of the prefill tokens | ||
# are skipped. Only keep the prefill output token | ||
adjusted_positions[current_pos] = pos | ||
adjusted_slot_mapping[current_pos] = adjusted_slot | ||
adjusted_hidden_states[current_pos] = ( | ||
target_hidden_states[end_idx - 1]) | ||
current_pos += 1 | ||
|
||
cu_num_tokens_index += 1 | ||
|
||
else: | ||
# Category 3: Full decode - shift tokens | ||
# Due to additional token in full prefill already, | ||
# all the corresponding decode rounds will shift one tokens | ||
adjusted_token_ids[current_pos:current_pos + req_len - | ||
1].copy_(target_token_ids[start_idx + | ||
1:end_idx]) | ||
|
||
adjusted_positions[current_pos:current_pos + req_len].copy_( | ||
target_positions[start_idx:end_idx] + 1) | ||
|
||
adjusted_slot_mapping[current_pos:current_pos + req_len - | ||
1].copy_(target_slot_mapping[start_idx + | ||
1:end_idx]) | ||
|
||
pos = adjusted_positions[current_pos + req_len - 1] | ||
block_number = pos // self.block_size | ||
block_number = block_table[i][block_number].item() | ||
block_offset = pos % self.block_size | ||
adjusted_slot_mapping[current_pos + req_len - | ||
1] = (block_number * self.block_size + | ||
block_offset) | ||
|
||
adjusted_hidden_states[current_pos:current_pos + | ||
req_len].copy_(target_hidden_states[ | ||
start_idx:end_idx]) | ||
|
||
current_pos += req_len | ||
cu_num_tokens_index += 1 | ||
|
||
# Update the cumulative token count for this request | ||
updated_cu_num_tokens[cu_num_tokens_index] = current_pos | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: morgendave <morgendave@gmail.com>
e19a8ee
to
2c4e856
Compare
@morgendave has imported this pull request. If you are a Meta employee, you can view this in D78360911. |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update