-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[BugFix] Fix KVConnector TP worker aggregation #21473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nick Hill <nhill@redhat.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes an issue with KVConnector TP worker aggregation for non-Ray pipeline-decoupled setups. The changes primarily involve adjusting the return logic in gpu_worker.py
for non-last pipeline parallel ranks to correctly handle KV cache transfer status. A potential issue was identified where the code could modify a shared mutable global constant. A code suggestion has been provided to address this and make the implementation more robust.
empty_output.finished_sending = output.finished_sending | ||
empty_output.finished_recving = output.finished_recving | ||
output = empty_output | ||
new_output = copy.copy(new_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be a deepcopy right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we updated the value of EMPTY_MODEL_RUNNER_OUTPUT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't completely follow what you mean but I'm pretty sure it doesn't need to be a deep copy. We are just copying because we don't want to modify the shared EMPTY_MODEL_RUNNER_OUTPUT
itself.
output = new_output | ||
|
||
assert isinstance(output, ModelRunnerOutput) | ||
# return output only from the driver worker | ||
return output if self.is_driver_worker else None | ||
return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change modifies the original behavior: if no KV connector is present, non-driver workers in the last PP rank now return output. I'm not certain this has any practical impact, though—under MultiprocExecutor
, only the worker with output_rank
sends its output back via WorkerProc
.
Suggested fix:
return new_output
assert isinstance(output, ModelRunnerOutput)
return_output = self.is_driver_worker or has_kv_transfer_group()
return output if return_output else None
The ray PD compatibility fix #21072 broke non-ray TP PD.
Discovered by @robertgshaw2-redhat