-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
v1/offloading: Add worker-side CPU support #21448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces worker-side support for CPU offloading, a significant feature enhancement. The changes include new abstractions for offloading, a worker queue manager for handling asynchronous transfers, and the necessary CPU-specific logic for tensor creation and data movement. The implementation is accompanied by a comprehensive set of tests that cover both the transfer logic and the asynchronous worker management.
My review has identified a couple of high-severity issues. One is a misleading docstring in the OffloadingManager
abstract class that could lead to incorrect implementations. The other is the use of __del__
for resource cleanup in OffloadingQueueManager
, which is unreliable and could lead to resource leaks. Addressing these points will improve the robustness and maintainability of the new offloading framework. Overall, this is a well-structured contribution.
This commit adds a new offloading component, composed of: 1. A scheduler side OffloadingManager (abstract) which kicks-off KV data transfers and keeps track of offloaded data. 2. A worker side OffloadingQueueManager which asynchronously manages KV transfers. Signed-off-by: Or Ozeri <oro@il.ibm.com>
4a19c04
to
a625bee
Compare
This commit adds worker-side support for CPU offloading. It uses the swap_blocks function to perform the actual copying between GPU and CPU. Supports any CPU block size which is divided by the GPU block size. Signed-off-by: Or Ozeri <oro@il.ibm.com>
a625bee
to
86a96a4
Compare
This PR adds worker-side support for CPU offloading.
It uses the
swap_blocks
function to perform the actual copying between CPU and GPU.Supports any
cpu_block_size
which is divided bygpu_block_size
.Part of the work described in RFC #19854
Depends on #19848