Skip to content

[V1][Neuron] Neuron chunked prefill V1 impl #21490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

elaineyz
Copy link
Contributor

@elaineyz elaineyz commented Jul 24, 2025

Purpose

This is the first PR to add support for NxD Inference on vLLM V1 architecture.

This PR offers a native implementation of Chunked Prefill on Neuron as described in Change 1 of RFC #21082.

Test Plan

An E2E test for chunked prefill is added. Note that the previous E2E tests meant to run on vLLM V0 engine have been removed from the Neuron buildkite script in anticipation for V0 deprecation.

Test Result

Manually ran test locally on a trainium instance and the test passed

(Optional) Documentation Update

elaineyz and others added 3 commits July 24, 2025 02:41
Co-authored-by: Aaron Dou <yzdou@amazon.com>
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
…essing

Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Copy link
Contributor

Warning

Gemini is unable to generate a review due to a potential policy violation.

@mergify mergify bot added documentation Improvements or additions to documentation ci/build v1 labels Jul 24, 2025
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
@elaineyz elaineyz marked this pull request as ready for review July 24, 2025 02:50
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mrinalks mrinalks added the aws-neuron Related to AWS Inferentia & Trainium label Jul 24, 2025
@aarondou
Copy link

Publish comments

"kernel_kv_tile_size": 4096,
},
"skip_warmup": True,
"save_sharded_checkpoint": True,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would remove this as it is optional for the purpose of the example. it is off by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

"kernel_q_tile_size": 128,
"kernel_kv_tile_size": 4096,
},
"skip_warmup": True,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we test if this can be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, inference still works correctly with warmup enabled, but I got very verbose error logs and warmup took 7 minutes. Thus keeping this line for now.

"""
This example is used to illustrate the usage when chunked prefill is enabled.
To run it, you need to set DISABLE_NEURON_CUSTOM_SCHEDULER=1 when with Neuron
plugin installed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a comment to instruct the chunked prefill doesn't support LNC2, so can only run on trn1 or trn2 with LNC1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I added "logical_nc_config": 1, directly to the override_neuron_config with a comment


outputs = llm.generate(prompts, sampling_params)

expected_outputs = [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the outputs deterministic? doesn't we need to set the random seed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the outputs should be deterministic since we're using greedy sampling

Comment on lines +54 to +60
else:
parallel_config.worker_cls = \
"vllm.worker.neuron_worker.NeuronWorker"
if vllm_config.cache_config and vllm_config.model_config:
# neuron needs block_size = max_model_len
vllm_config.cache_config.block_size = \
vllm_config.model_config.max_model_len # type: ignore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this branch will need to be removed (in the PR or follow up) given that v0 is code paths will be gone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to leave v0 deletion to a separate PR for simplicity

neuron_config)

override_neuron_config = model_config.override_neuron_config
architecture, num_key_value_heads, head_dim = _get_model_configs(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed all mypy warnings/errors

)
}

def _update_states(self, scheduler_output: "SchedulerOutput") -> bool:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would add a note that this is identical to the GPU (TPU as well?).
nonetheless, feel this can be potentially put into the base model runner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, will add a note

elaineyz added 2 commits July 25, 2025 00:12
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws-neuron Related to AWS Inferentia & Trainium ci/build documentation Improvements or additions to documentation v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants