[Not for merge] Unshift eagle prefill #21008

morgendave · 2025-07-15T19:00:23Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Signed-off-by: qizixi <qizixi@meta.com>

Signed-off-by: morgendave <morgendave@gmail.com>

rope change rope change rebase

Signed-off-by: morgendave <morgendave@gmail.com>

github-actions · 2025-07-15T19:00:31Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-07-15T19:01:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @morgendave.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

facebook-github-bot · 2025-07-15T19:01:50Z

@morgendave has imported this pull request. If you are a Meta employee, you can view this in D78360911.

gemini-code-assist

Code Review

This pull request introduces a new prefill mechanism for Eagle speculative decoding, which also adds support for multimodal models and KV cache sharing. The changes are extensive and touch core components of the engine.

My review has identified a critical issue in the test parameterization that will prevent the new tests from running correctly. Additionally, I've found a couple of significant performance concerns in performance-critical paths where Python loops are used to iterate over batch items instead of more efficient vectorized operations. Addressing these points will be important for the correctness and performance of this new feature.

gemini-code-assist · 2025-07-15T19:02:01Z

tests/v1/e2e/test_spec_decode.py

+
+
+@pytest.mark.parametrize(
+    "model_setup,mm_enabled", [


The pytest.mark.parametrize decorator is missing the prefill_shift parameter. This will cause the test to fail with a fixture 'prefill_shift' not found error.

Suggested change

"model_setup,mm_enabled", [

"model_setup,mm_enabled,prefill_shift", [

gemini-code-assist · 2025-07-15T19:02:01Z

vllm/v1/spec_decode/eagle.py

+
+                    adjusted_positions[current_pos:current_pos +
+                                       req_len].copy_(
+                                           target_positions[start_idx:end_idx])
+
+                    adjusted_slot_mapping[current_pos:current_pos +
+                                          req_len].copy_(target_slot_mapping[
+                                              start_idx:end_idx])
+
+                    # Put the first prefill hidden state in the first position
+                    # and shift all the other ones, this matches the sequence
+                    # as non-shifting will include the first prefill token
+                    adjusted_hidden_states[current_pos + 1:current_pos +
+                                           req_len].copy_(
+                                               target_hidden_states[start_idx +
+                                                                    1:end_idx])
+
+                    adjusted_hidden_states[
+                        current_pos] = prefill_first_hiddens[i]
+                    current_pos += req_len
+                    cu_num_tokens_index += 1
+
+            elif is_full_prefill:
+                # Category 2: Full prefill with decode:
+                # copy tokens and add one position
+                pos = target_positions[end_idx - 1] + 1
+                block_number = pos // self.block_size
+                block_number = block_table[i][block_number].item()
+                block_offset = pos % self.block_size
+                adjusted_slot = (block_number * self.block_size + block_offset)
+
+                if not self.draft_prefill_kv_sharing_from_base:
+                    # copy the original and adjust the one additional token
+                    # for position, slot mapping and hidden state
+                    adjusted_token_ids[current_pos:current_pos +
+                                       req_len].copy_(
+                                           target_token_ids[start_idx:end_idx])
+
+                    adjusted_positions[current_pos:current_pos +
+                                       req_len].copy_(
+                                           target_positions[start_idx:end_idx])
+                    adjusted_positions[current_pos + req_len] = pos
+
+                    adjusted_slot_mapping[current_pos:current_pos +
+                                          req_len].copy_(target_slot_mapping[
+                                              start_idx:end_idx])
+                    adjusted_slot_mapping[current_pos +
+                                          req_len] = (adjusted_slot)
+
+                    adjusted_hidden_states[
+                        current_pos + 1:current_pos + req_len + 1].copy_(
+                            target_hidden_states[start_idx:end_idx])
+
+                    adjusted_hidden_states[
+                        current_pos] = prefill_first_hiddens[i]
+                    current_pos += req_len + 1
+                else:
+                    # if we enable copy kv then all of the prefill tokens
+                    # are skipped. Only keep the prefill output token
+                    adjusted_positions[current_pos] = pos
+                    adjusted_slot_mapping[current_pos] = adjusted_slot
+                    adjusted_hidden_states[current_pos] = (
+                        target_hidden_states[end_idx - 1])
+                    current_pos += 1
+
+                cu_num_tokens_index += 1
+
+            else:
+                # Category 3: Full decode - shift tokens
+                # Due to additional token in full prefill already,
+                # all the corresponding decode rounds will shift one tokens
+                adjusted_token_ids[current_pos:current_pos + req_len -
+                                   1].copy_(target_token_ids[start_idx +
+                                                             1:end_idx])
+
+                adjusted_positions[current_pos:current_pos + req_len].copy_(
+                    target_positions[start_idx:end_idx] + 1)
+
+                adjusted_slot_mapping[current_pos:current_pos + req_len -
+                                      1].copy_(target_slot_mapping[start_idx +
+                                                                   1:end_idx])
+
+                pos = adjusted_positions[current_pos + req_len - 1]
+                block_number = pos // self.block_size
+                block_number = block_table[i][block_number].item()
+                block_offset = pos % self.block_size
+                adjusted_slot_mapping[current_pos + req_len -
+                                      1] = (block_number * self.block_size +
+                                            block_offset)
+
+                adjusted_hidden_states[current_pos:current_pos +
+                                       req_len].copy_(target_hidden_states[
+                                           start_idx:end_idx])
+
+                current_pos += req_len
+                cu_num_tokens_index += 1
+
+            # Update the cumulative token count for this request
+            updated_cu_num_tokens[cu_num_tokens_index] = current_pos
+


This function processes requests in a batch using a Python for loop. In a performance-critical path like this, iterating over requests one by one can introduce significant overhead. Consider refactoring this logic to be fully vectorized using PyTorch operations.

Signed-off-by: morgendave <morgendave@gmail.com>

facebook-github-bot · 2025-07-15T21:21:53Z

@morgendave has imported this pull request. If you are a Meta employee, you can view this in D78360911.

zixi-qi and others added 8 commits July 10, 2025 14:49

[Meta] Llama4 EAGLE Support Co-authored-by: Zixi Qi <qizixi@meta.com>

f1c861d

Signed-off-by: qizixi <qizixi@meta.com>

linting fixes

4178543

Add llama4 eagle to e2e test for spec decode

2f8fe49

Add Meta copyright and license

4238b3a

add pytest skip for eagle llama4-scout to avoid OOM in CI

716ec8c

eagle mm support, primarily llama4

d64bf91

Signed-off-by: morgendave <morgendave@gmail.com>

initial commit for non-shifting prefill in eagle, prepare for kv sharing

5546b9f

rope change rope change rebase

add kv copy logic and offline tests

2e9541f

Signed-off-by: morgendave <morgendave@gmail.com>

morgendave changed the title ~~Unshift eagle prefill~~ [just for coding sharing] Unshift eagle prefill Jul 15, 2025

mergify bot added documentation Improvements or additions to documentation llama Related to Llama models new-model Requests to new models speculative-decoding labels Jul 15, 2025

mergify bot added v1 needs-rebase labels Jul 15, 2025

morgendave changed the title ~~[just for coding sharing] Unshift eagle prefill~~ [Not for merge] Unshift eagle prefill Jul 15, 2025

gemini-code-assist bot reviewed Jul 15, 2025

View reviewed changes

Add examples and algorithm for non-shifting, fixes some minor issues

2c4e856

Signed-off-by: morgendave <morgendave@gmail.com>

morgendave force-pushed the unshift-eagle-prefill branch from e19a8ee to 2c4e856 Compare July 15, 2025 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Not for merge] Unshift eagle prefill #21008

[Not for merge] Unshift eagle prefill #21008

morgendave commented Jul 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

mergify bot commented Jul 15, 2025

Uh oh!

facebook-github-bot commented Jul 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 15, 2025

Uh oh!

gemini-code-assist bot Jul 15, 2025

Uh oh!

facebook-github-bot commented Jul 15, 2025

Uh oh!

Uh oh!

	"model_setup,mm_enabled", [
	"model_setup,mm_enabled,prefill_shift", [



		@pytest.mark.parametrize(
		"model_setup,mm_enabled", [

Uh oh!

[Not for merge] Unshift eagle prefill #21008

Are you sure you want to change the base?

[Not for merge] Unshift eagle prefill #21008

Conversation

morgendave commented Jul 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

mergify bot commented Jul 15, 2025

Uh oh!

facebook-github-bot commented Jul 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 15, 2025

Uh oh!

Uh oh!

morgendave commented Jul 15, 2025 •

edited by github-actions bot

Loading