-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[1/N] Refactor platform API to reduce torch.cuda
call
#20751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[1/N] Refactor platform API to reduce torch.cuda
call
#20751
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @jikunshang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request is the first in a series ([1/N] of RFC #20708) aimed at refactoring device-specific operations, particularly those involving torch.cuda
, into a unified Platform
API. My goal is to enhance multi-platform compatibility by allowing VLLM to seamlessly support different hardware backends like CPU, CUDA, ROCm, and XPU through a common interface, making the codebase more modular and maintainable.
Highlights
- Platform Abstraction: I've introduced new abstract methods (
empty_cache
,reset_peak_memory_stats
,mem_get_info
,memory_stats
,memory_reserved
,synchronize
) to the basePlatform
interface (vllm/platforms/interface.py
). This establishes a standardized way to interact with device-specific functionalities. - Multi-Platform Implementation: I've implemented these new platform-agnostic methods for CPU, CUDA, ROCm, and XPU platforms. Each implementation encapsulates the respective device-specific calls (e.g.,
torch.cuda.empty_cache()
for CUDA,torch.xpu.empty_cache()
for XPU, orpass
for CPU where applicable), centralizing device management. - Reduced Direct Dependency: I've replaced direct
torch.cuda
andtorch.xpu
calls withcurrent_platform
calls across various core modules. This includes areas like model execution, memory profiling, and worker initialization, significantly reducing direct dependencies on specific PyTorch backends and improving code portability. - XPU-Specific Memory Handling: I've refactored XPU-specific memory information retrieval into the
XPUPlatform
class, including special logic for handlingmem_get_info
on client GPUs, which often report memory differently than data center GPUs.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a solid step towards abstracting platform-specific device calls by introducing a Platform
API. The refactoring is applied consistently across CUDA, ROCm, XPU, and CPU platforms, which improves code maintainability and extensibility.
I've identified a critical issue in the CPU platform implementation where the memory-related stubs would cause a runtime crash. I've also pointed out a minor maintainability issue regarding a magic number in the XPU implementation. Once these points are addressed, the PR will be in great shape.
@classmethod | ||
def mem_get_info(cls): | ||
# FIXME: impl | ||
return None | ||
|
||
@classmethod | ||
def memory_stats(cls): | ||
# FIXME: impl | ||
return None | ||
|
||
@classmethod | ||
def memory_reserved(cls): | ||
# FIXME: impl | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stubs for mem_get_info
, memory_stats
, and memory_reserved
return None
, which will cause runtime errors in vllm.utils.memory_profiling
when running on the CPU platform.
mem_get_info()
returningNone
will cause aTypeError
on unpacking.memory_stats()
returningNone
will cause anAttributeError
when.get()
is called.memory_reserved()
returningNone
will cause aTypeError
in arithmetic operations.
These methods should return sensible default values for the CPU platform to prevent crashes. For example:
mem_get_info
:(0, 0)
memory_stats
:{}
memory_reserved
:0
@classmethod | |
def mem_get_info(cls): | |
# FIXME: impl | |
return None | |
@classmethod | |
def memory_stats(cls): | |
# FIXME: impl | |
return None | |
@classmethod | |
def memory_reserved(cls): | |
# FIXME: impl | |
return None | |
@classmethod | |
def mem_get_info(cls): | |
# FIXME: impl | |
# Returning (0, 0) as a placeholder for (free, total) memory. | |
return (0, 0) | |
@classmethod | |
def memory_stats(cls): | |
# FIXME: impl | |
# Returning an empty dict as a placeholder for memory stats. | |
return {} | |
@classmethod | |
def memory_reserved(cls): | |
# FIXME: impl | |
# Returning 0 as a placeholder for reserved memory. | |
return 0 |
non_torch_allocations = 128 * 1024 * 1024 | ||
free_gpu_memory = total_gpu_memory - (used_memory + | ||
non_torch_allocations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hardcoded value 128 * 1024 * 1024
is a "magic number". To improve readability and maintainability, it's better to define it as a constant. Using an uppercase variable name is a common convention for constants in Python.
non_torch_allocations = 128 * 1024 * 1024 | |
free_gpu_memory = total_gpu_memory - (used_memory + | |
non_torch_allocations) | |
NON_TORCH_ALLOCATIONS_BYTES = 128 * 1024 * 1024 | |
free_gpu_memory = total_gpu_memory - (used_memory + | |
NON_TORCH_ALLOCATIONS_BYTES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @jikunshang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request is the first in a series ([1/N]
) aimed at refactoring the codebase to reduce direct dependencies on torch.cuda
calls. The primary goal is to enhance platform agnosticism by introducing a unified Platform
interface for common device operations like memory management and synchronization. This change abstracts hardware-specific calls, making it easier to support new devices (CPU, CUDA, ROCm, XPU are covered here) and improving code maintainability by centralizing device interaction logic.
Highlights
- Platform Abstraction: Introduced a set of new abstract methods (
empty_cache
,reset_peak_memory_stats
,mem_get_info
,memory_stats
,memory_reserved
,synchronize
) to thePlatform
interface (vllm/platforms/interface.py
). - Multi-Platform Support: Implemented these new platform-agnostic methods for
CpuPlatform
,CudaPlatform
,RocmPlatform
, andXpuPlatform
, enabling consistent device interaction across different hardware. - Code Refactoring: Replaced direct
torch.cuda
(andtorch.xpu
) calls with calls tocurrent_platform
across various modules, includingeplb_state.py
,rebalance_execute.py
,fused_moe/layer.py
,fp8.py
,bitsandbytes_loader.py
,utils/__init__.py
,gpu_model_runner.py
,gpu_worker.py
,xpu_model_runner.py
, andxpu_worker.py
. - XPU Simplification: Significantly simplified
xpu_model_runner.py
andxpu_worker.py
by removing redundant device-specific memory profiling and synchronization logic, now handled by the generalizedPlatform
interface.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the platform API to reduce direct calls to torch.cuda
and introduces a platform abstraction layer for CPU, CUDA, ROCm, and XPU. It adds several API methods to the Platform
class for memory management and synchronization. The changes aim to provide a more generic and extensible interface for different hardware platforms. The code review identified some inconsistencies where the code was not fully adhering to the new platform abstraction, and those have been noted in the comments.
self.num_sms = self.device_properties.multi_processor_count | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def mem_get_info(cls): | ||
# FIXME: impl | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | ||
self.device_properties = torch.cuda.get_device_properties(self.device) | ||
self.num_sms = self.device_properties.multi_processor_count | ||
if current_platform.is_cuda(): | ||
self.device_properties = torch.cuda.get_device_properties( | ||
self.device) | ||
self.num_sms = self.device_properties.multi_processor_count | ||
else: | ||
self.num_sms = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else
block is executed when the platform is not CUDA, but torch.cuda.get_device_properties
is CUDA specific. This could lead to errors on other platforms. It's better to use current_platform
to get device properties in a platform-agnostic way.
if current_platform.is_cuda():
self.device_properties = torch.cuda.get_device_properties(
self.device)
self.num_sms = self.device_properties.multi_processor_count
else:
self.num_sms = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the test plan? If you generalize gpu_model_runner.py and gpu_worker.py to support devices other than cuda, such as xpu, then tests available in the vllm test suite:
- Should not regress for cuda - this will be verified by existing vllm ci
- Should start passing for xpu in the scope which previously failed and don't regress for xpu in the scope which previously passed
Can details on the 2nd bullet above be included in the PR description following existing vLLM PR submission guidelines?
@@ -348,7 +349,7 @@ def rearrange(self, | |||
time_start = None | |||
is_main_rank = ep_rank == 0 | |||
if is_main_rank: | |||
torch.cuda.synchronize() | |||
current_platform.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below. Same can be done with the standard pytorch API available starting from torch 2.6:
current_platform.synchronize() | |
torch.accelerator.synchronize() |
Are there actual benefits to define similar device abstraction on vLLM level? Using standard pytorch API will help to have a leaner vLLM code base. See https://docs.pytorch.org/docs/stable/generated/torch.accelerator.synchronize.html#torch.accelerator.synchronize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we'd better implment base class Platform::synchronize() method using torch.accelerator.synchronize()
and leave it for platforms to implement their own in case there are any tricks, like pytorch/pytorch#155668
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be good compromise. Which torch version does vLLM target across device backends? Note that torch.accelerator
is available from 2.6. If vLLM needs to support wider torch range, this can be a clear reason to abstract this API on the vLLM level. Also, if you see any missing APIs in torch.acclerator
, please, feedback - we are willing to take care of that on pytorch level.
cc @gshtras @bigPYJ1151 Please take a look. in case this PR breaks rocm/cpu |
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
|
||
@classmethod | ||
def synchronize(cls): | ||
torch.cpu.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just using pass is enough.
This pull request has merge conflicts that must be resolved before it can be |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
[1/N] of RFC #20708.
this PR add several API in
Platform
class , such asempty_cache
,reset_peak_memory_stats
,mem_get_info
,memory_stats
,memory_reserved
,synchronize
, which is directly called viatorch.cuda.xxx
. Meanwhile, I add these API for cpu/cuda/rocm/xpu platform. I can also add to other platform if necessary.Note: I didn't change V0 code path since it may be removed. And I didn't change calls in tests/benchmarks/examples folder yet.
Test Plan
Test Result
(Optional) Documentation Update