-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[V1][Neuron] Neuron chunked prefill V1 impl #21490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Aaron Dou <yzdou@amazon.com> Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
…essing Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Warning Gemini is unable to generate a review due to a potential policy violation. |
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Publish comments |
"kernel_kv_tile_size": 4096, | ||
}, | ||
"skip_warmup": True, | ||
"save_sharded_checkpoint": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would remove this as it is optional for the purpose of the example. it is off by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
"kernel_q_tile_size": 128, | ||
"kernel_kv_tile_size": 4096, | ||
}, | ||
"skip_warmup": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we test if this can be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, inference still works correctly with warmup enabled, but I got very verbose error logs and warmup took 7 minutes. Thus keeping this line for now.
""" | ||
This example is used to illustrate the usage when chunked prefill is enabled. | ||
To run it, you need to set DISABLE_NEURON_CUSTOM_SCHEDULER=1 when with Neuron | ||
plugin installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a comment to instruct the chunked prefill doesn't support LNC2, so can only run on trn1 or trn2 with LNC1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. I added "logical_nc_config": 1,
directly to the override_neuron_config with a comment
|
||
outputs = llm.generate(prompts, sampling_params) | ||
|
||
expected_outputs = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the outputs deterministic? doesn't we need to set the random seed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the outputs should be deterministic since we're using greedy sampling
else: | ||
parallel_config.worker_cls = \ | ||
"vllm.worker.neuron_worker.NeuronWorker" | ||
if vllm_config.cache_config and vllm_config.model_config: | ||
# neuron needs block_size = max_model_len | ||
vllm_config.cache_config.block_size = \ | ||
vllm_config.model_config.max_model_len # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this branch will need to be removed (in the PR or follow up) given that v0 is code paths will be gone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to leave v0 deletion to a separate PR for simplicity
neuron_config) | ||
|
||
override_neuron_config = model_config.override_neuron_config | ||
architecture, num_key_value_heads, head_dim = _get_model_configs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed all mypy warnings/errors
) | ||
} | ||
|
||
def _update_states(self, scheduler_output: "SchedulerOutput") -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would add a note that this is identical to the GPU (TPU as well?).
nonetheless, feel this can be potentially put into the base model runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, will add a note
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Purpose
This is the first PR to add support for NxD Inference on vLLM V1 architecture.
This PR offers a native implementation of Chunked Prefill on Neuron as described in Change 1 of RFC #21082.
Test Plan
An E2E test for chunked prefill is added. Note that the previous E2E tests meant to run on vLLM V0 engine have been removed from the Neuron buildkite script in anticipation for V0 deprecation.
Test Result
Manually ran test locally on a trainium instance and the test passed
(Optional) Documentation Update