Skip to content

[Not for merge] Unshift eagle prefill #21008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

morgendave
Copy link
Collaborator

@morgendave morgendave commented Jul 15, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@morgendave morgendave changed the title Unshift eagle prefill [just for coding sharing] Unshift eagle prefill Jul 15, 2025
@mergify mergify bot added documentation Improvements or additions to documentation llama Related to Llama models new-model Requests to new models speculative-decoding labels Jul 15, 2025
Copy link

mergify bot commented Jul 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @morgendave.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@facebook-github-bot
Copy link

@morgendave has imported this pull request. If you are a Meta employee, you can view this in D78360911.

@morgendave morgendave changed the title [just for coding sharing] Unshift eagle prefill [Not for merge] Unshift eagle prefill Jul 15, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new prefill mechanism for Eagle speculative decoding, which also adds support for multimodal models and KV cache sharing. The changes are extensive and touch core components of the engine.

My review has identified a critical issue in the test parameterization that will prevent the new tests from running correctly. Additionally, I've found a couple of significant performance concerns in performance-critical paths where Python loops are used to iterate over batch items instead of more efficient vectorized operations. Addressing these points will be important for the correctness and performance of this new feature.



@pytest.mark.parametrize(
"model_setup,mm_enabled", [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The pytest.mark.parametrize decorator is missing the prefill_shift parameter. This will cause the test to fail with a fixture 'prefill_shift' not found error.

Suggested change
"model_setup,mm_enabled", [
"model_setup,mm_enabled,prefill_shift", [

Comment on lines +228 to +344

adjusted_positions[current_pos:current_pos +
req_len].copy_(
target_positions[start_idx:end_idx])

adjusted_slot_mapping[current_pos:current_pos +
req_len].copy_(target_slot_mapping[
start_idx:end_idx])

# Put the first prefill hidden state in the first position
# and shift all the other ones, this matches the sequence
# as non-shifting will include the first prefill token
adjusted_hidden_states[current_pos + 1:current_pos +
req_len].copy_(
target_hidden_states[start_idx +
1:end_idx])

adjusted_hidden_states[
current_pos] = prefill_first_hiddens[i]
current_pos += req_len
cu_num_tokens_index += 1

elif is_full_prefill:
# Category 2: Full prefill with decode:
# copy tokens and add one position
pos = target_positions[end_idx - 1] + 1
block_number = pos // self.block_size
block_number = block_table[i][block_number].item()
block_offset = pos % self.block_size
adjusted_slot = (block_number * self.block_size + block_offset)

if not self.draft_prefill_kv_sharing_from_base:
# copy the original and adjust the one additional token
# for position, slot mapping and hidden state
adjusted_token_ids[current_pos:current_pos +
req_len].copy_(
target_token_ids[start_idx:end_idx])

adjusted_positions[current_pos:current_pos +
req_len].copy_(
target_positions[start_idx:end_idx])
adjusted_positions[current_pos + req_len] = pos

adjusted_slot_mapping[current_pos:current_pos +
req_len].copy_(target_slot_mapping[
start_idx:end_idx])
adjusted_slot_mapping[current_pos +
req_len] = (adjusted_slot)

adjusted_hidden_states[
current_pos + 1:current_pos + req_len + 1].copy_(
target_hidden_states[start_idx:end_idx])

adjusted_hidden_states[
current_pos] = prefill_first_hiddens[i]
current_pos += req_len + 1
else:
# if we enable copy kv then all of the prefill tokens
# are skipped. Only keep the prefill output token
adjusted_positions[current_pos] = pos
adjusted_slot_mapping[current_pos] = adjusted_slot
adjusted_hidden_states[current_pos] = (
target_hidden_states[end_idx - 1])
current_pos += 1

cu_num_tokens_index += 1

else:
# Category 3: Full decode - shift tokens
# Due to additional token in full prefill already,
# all the corresponding decode rounds will shift one tokens
adjusted_token_ids[current_pos:current_pos + req_len -
1].copy_(target_token_ids[start_idx +
1:end_idx])

adjusted_positions[current_pos:current_pos + req_len].copy_(
target_positions[start_idx:end_idx] + 1)

adjusted_slot_mapping[current_pos:current_pos + req_len -
1].copy_(target_slot_mapping[start_idx +
1:end_idx])

pos = adjusted_positions[current_pos + req_len - 1]
block_number = pos // self.block_size
block_number = block_table[i][block_number].item()
block_offset = pos % self.block_size
adjusted_slot_mapping[current_pos + req_len -
1] = (block_number * self.block_size +
block_offset)

adjusted_hidden_states[current_pos:current_pos +
req_len].copy_(target_hidden_states[
start_idx:end_idx])

current_pos += req_len
cu_num_tokens_index += 1

# Update the cumulative token count for this request
updated_cu_num_tokens[cu_num_tokens_index] = current_pos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function processes requests in a batch using a Python for loop. In a performance-critical path like this, iterating over requests one by one can introduce significant overhead. Consider refactoring this logic to be fully vectorized using PyTorch operations.

Signed-off-by: morgendave <morgendave@gmail.com>
@morgendave morgendave force-pushed the unshift-eagle-prefill branch from e19a8ee to 2c4e856 Compare July 15, 2025 21:06
@facebook-github-bot
Copy link

@morgendave has imported this pull request. If you are a Meta employee, you can view this in D78360911.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation llama Related to Llama models needs-rebase new-model Requests to new models speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants