Description
Motivation.
This RFC proposes changes to the vLLM plugin infrastructure to address compatibility issues, simplify out-of-tree code maintenance, and improve overall user experience.
Problem Statement
-
Registering and Dispatching Semi-Custom Operators:
- Vendors need a mechanism to register and dispatch custom implementations for generic layers like
RotaryEmbedding
andRMSNorm
without modifying standard model code. - Current Limitations:
- The existing
forward_oot
method inCustomOp
requires monkey-patching, which makes plugin code vulnerable to upstream changes. For example, Ascend's RoPE implementation relies on layer properties that are not guaranteed, risking breakage with any refactor. - There is no straightforward way to dispatch new out-of-tree operators without altering model code, as seen with AscendFusedMoE and custom DeepSeek models.
- The existing
- Vendors need a mechanism to register and dispatch custom implementations for generic layers like
-
Ensuring Out-Of-Tree Compatibility:
- Upstream changes can break plugin compatibility, placing the burden on plugin maintainers to address these issues.
- Platform workers and model runners consume vLLM APIs that are not guaranteed to be stable - e.g.
vllm.v1.worker.gpu_input_batch.InputBatch
,vllm.v1.sample.metadata.SamplingMetadata
,vllm.sampling_params.SamplingType
) - Additionally: Lack of basic functionality checks for plugin compatibility in upstream leads to uncertainty about whether the latest plugin will work with the latest vLLM, forcing users to mix and match versions.
- Platform workers and model runners consume vLLM APIs that are not guaranteed to be stable - e.g.
- Lack of clearly defined, immutable interfaces for communication between vLLM and out-of-tree plugins.
- Both "vLLM-consumed" and "plugin-consumed" API definitions are crucial for stable integration.
- Upstream changes can break plugin compatibility, placing the burden on plugin maintainers to address these issues.
-
No In-Tree Visibility:
- Plugins live entirely out-of-tree, requiring users to clone, build, and install specific plugin repositories or packages.
- Installing "plugin-less" vLLM via
pip install vllm
lacks any skeleton for out-of-tree platform support, and upstream development is also unaware of these backends.- If user chooses an out-of-tree backend without installing a proper plugin first, vLLM won't recognize that this platform is indeed supported via an out-of-tree plugin
Proposed Change.
- Registering and Dispatching Semi-Custom Operators:
- Implement robust registration and automatic dispatch of out-of-tree operators. There is already a registration and dispatch logic for models with support for overriding (e.g. register_model) which can be leveraged by vendors.
- The integration could look as follows:
from vllm import OperatorRegistry
def register_operators():
from .custom_ops import CustomRotaryEmbedding, CustomRMSNorm
OperatorRegistry.register_operator(
"rotary_embedding",
"vllm_plugin.custom_ops:CustomRotaryEmbedding")
OperatorRegistry.register_operator(
"rms_norm",
"vllm_plugin.custom_ops:CustomRMSNorm")
- The defined operators are expected to match APIs of their corresponding ops, and this can be easily achieved via inheritance, e.g:
from vllm.model_executor.layers.layernorm import RMSNorm
from vllm.model_executor.layers.rotary_embedding import RotaryEmbedding
from vllm.logger import init_logger
logger = init_logger(__name__)
class CustomRMSNorm(RMSNorm):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Additional initialization for the dummy platform can go here.
logger.warning("CustomRMSNorm initialized.")
def forward(self, *args, **kwargs):
logger.warning("CustomRMSNorm fwd pass.")
return super().forward(*args, **kwargs)
class CustomRotaryEmbedding(RotaryEmbedding):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Additional initialization for the dummy platform can go here.
logger.warning("CustomRotaryEmbedding initialized.")
def forward(self, *args, **kwargs):
logger.warning("CustomRotaryEmbedding fwd pass.")
return super().forward(*args, **kwargs)
-
The difference between this approach, and
forward_oot
is that vendors have freedom to either completely override the existing initialization logic, perform additional sanity checks, and explicitly define class attributes used in forward passes.- Drawbacks:
- This will require refactor of the model code. Models should fetch the operator class by name from the registry rather than utilizing the class fromvllm.model_executor.layers
.
- This will need to ensure that all in-tree operators are registered before any registration occurs out-of-tree.
- Drawbacks:
- Defined Platform Stubs and APIs:
- Provide clearly defined API stubs in upstream (e.g.,
vllm.platforms.hpu
,vllm.platforms.npu
) for plugin integrationPlatform
interface is a very good starting point, but vLLM should be aware of platforms that have established plugins and have lightweight "manifest" of them, extensible by out-of-tree code.- In-tree stub support with out-of-tree implementation.
- Require specific, versioned, strongly-typed definitions of exposed APIs and dataclasses, covering both consumed (e.g.
vllm.v1.worker.gpu_input_batch.InputBatch
,vllm.v1.sample.metadata.SamplingMetadata
,vllm.sampling_params.SamplingType
) and produced APIs (e.g.vllm_plugin.v1.worker.init_device, determine_available_memory
)
- CI/Code Scanning Checks:
- Implement basic CI/code scanning checks in upstream for plugin compatibility.
- Once platform stubs and produced/consumed APIs are defined, extensions can be covered by static code checks and mypy should detect incompatibilities.
- More ideal, long-term solution - actual unit test execution on the pluggable platforms in vLLM PRs
Feedback Period.
24/7
CC List.
@simon-mo @WoosukKwon @xuechendi
Any Other Things.
The problem statements and proposed solutions are ordered by priority - it's most crucial to address the points in the beginning (the custom ops). The proposed solutions for later points may evolve over time based on feedback.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.