[V1][Neuron] Neuron chunked prefill V1 impl #21490

elaineyz · 2025-07-24T02:46:09Z

Purpose

This is the first PR to add support for NxD Inference on vLLM V1 architecture.

This PR offers a native implementation of Chunked Prefill on Neuron as described in Change 1 of RFC #21082.

Test Plan

An E2E test for chunked prefill is added. Note that the previous E2E tests meant to run on vLLM V0 engine have been removed from the Neuron buildkite script in anticipation for V0 deprecation.

Test Result

Manually ran test locally on a trainium instance and the test passed

(Optional) Documentation Update

Co-authored-by: Aaron Dou <yzdou@amazon.com> Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

…essing Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

gemini-code-assist · 2025-07-24T02:46:17Z

Warning

Gemini is unable to generate a review due to a potential policy violation.

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

github-actions · 2025-07-24T03:17:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

aarondou · 2025-07-24T21:39:59Z

Publish comments

aarondou · 2025-07-24T15:40:35Z

examples/offline_inference/neuron-v1/neuronx_distributed_chunked_prefill.py

+            "kernel_kv_tile_size": 4096,
+        },
+        "skip_warmup": True,
+        "save_sharded_checkpoint": True,


would remove this as it is optional for the purpose of the example. it is off by default

aarondou · 2025-07-24T15:41:35Z

examples/offline_inference/neuron-v1/neuronx_distributed_chunked_prefill.py

+            "kernel_q_tile_size": 128,
+            "kernel_kv_tile_size": 4096,
+        },
+        "skip_warmup": True,


could we test if this can be removed?

yes, inference still works correctly with warmup enabled, but I got very verbose error logs and warmup took 7 minutes. Thus keeping this line for now.

aarondou · 2025-07-24T15:42:59Z

examples/offline_inference/neuron-v1/neuronx_distributed_chunked_prefill.py

+"""
+This example is used to illustrate the usage when chunked prefill is enabled.
+To run it, you need to set DISABLE_NEURON_CUSTOM_SCHEDULER=1 when with Neuron
+plugin installed.


let's add a comment to instruct the chunked prefill doesn't support LNC2, so can only run on trn1 or trn2 with LNC1

good catch. I added "logical_nc_config": 1, directly to the override_neuron_config with a comment

aarondou · 2025-07-24T15:44:22Z

tests/neuron/2_core/test_chunked_prefill.py

+
+    outputs = llm.generate(prompts, sampling_params)
+
+    expected_outputs = [


are the outputs deterministic? doesn't we need to set the random seed?

yes the outputs should be deterministic since we're using greedy sampling

aarondou · 2025-07-24T15:46:11Z

vllm/platforms/neuron.py

+            else:
+                parallel_config.worker_cls = \
+                    "vllm.worker.neuron_worker.NeuronWorker"
+                if vllm_config.cache_config and vllm_config.model_config:
+                    # neuron needs block_size = max_model_len
+                    vllm_config.cache_config.block_size = \
+                        vllm_config.model_config.max_model_len  # type: ignore


this branch will need to be removed (in the PR or follow up) given that v0 is code paths will be gone

Prefer to leave v0 deletion to a separate PR for simplicity

aarondou · 2025-07-24T15:49:18Z

vllm/v1/worker/neuronx_distributed_model_loader.py

+                                            neuron_config)
+
+    override_neuron_config = model_config.override_neuron_config
+    architecture, num_key_value_heads, head_dim = _get_model_configs(


fixed all mypy warnings/errors

vllm/v1/worker/neuronx_distributed_model_loader.py

vllm/v1/worker/neuronx_distributed_model_runner.py

aarondou · 2025-07-24T15:53:07Z

vllm/v1/worker/neuronx_distributed_model_runner.py

+            )
+        }
+
+    def _update_states(self, scheduler_output: "SchedulerOutput") -> bool:


would add a note that this is identical to the GPU (TPU as well?).
nonetheless, feel this can be potentially put into the base model runner.

makes sense, will add a note

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

elaineyz and others added 3 commits July 24, 2025 02:41

[V1][Neuron] Neuron chunked prefill V1 impl

c8c2b22

Co-authored-by: Aaron Dou <yzdou@amazon.com> Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

remove workarounds in scheduler and combine cached request input proc…

9f66a16

…essing Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

Formatting fixes

e8842a0

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

mergify bot added documentation Improvements or additions to documentation ci/build v1 labels Jul 24, 2025

Minor cleanup

1ef95e0

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

elaineyz marked this pull request as ready for review July 24, 2025 02:50

elaineyz requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners July 24, 2025 02:50

mrinalks added the aws-neuron Related to AWS Inferentia & Trainium label Jul 24, 2025

aarondou reviewed Jul 24, 2025

View reviewed changes

elaineyz added 2 commits July 25, 2025 00:12

Fix pre-commit, address comments

711e368

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>

fix yapf vs isort conflict

64cac19

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>


		outputs = llm.generate(prompts, sampling_params)

		expected_outputs = [

Uh oh!

[V1][Neuron] Neuron chunked prefill V1 impl #21490

Are you sure you want to change the base?

[V1][Neuron] Neuron chunked prefill V1 impl #21490

Conversation

elaineyz commented Jul 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot commented Jul 24, 2025

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

aarondou commented Jul 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elaineyz commented Jul 24, 2025 •

edited by github-actions bot

Loading