[RFC]: Enhancing vLLM Plugin Architecture

### Motivation.

This RFC proposes changes to the vLLM plugin infrastructure to address compatibility issues, simplify out-of-tree code maintenance, and improve overall user experience.

### Problem Statement

1. **Registering and Dispatching Semi-Custom Operators:**
   - Vendors need a mechanism to register and dispatch custom implementations for generic layers like `RotaryEmbedding` and `RMSNorm` without modifying standard model code.
   - **Current Limitations:**
     - The existing [`forward_oot`](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/custom_op.py#L66-L69) method in `CustomOp` requires monkey-patching, which makes plugin code vulnerable to upstream changes. For example, [Ascend's RoPE](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/ops/rotary_embedding.py#L273) implementation relies on layer properties that are not guaranteed, risking breakage with any refactor.
     - There is no straightforward way to dispatch new out-of-tree operators without altering model code, as seen with [AscendFusedMoE](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/ops/fused_moe.py#L704) and custom [DeepSeek](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/models/deepseek_v2.py#L182-L195) models.

2. **Ensuring Out-Of-Tree Compatibility:**
   - Upstream changes can break plugin compatibility, placing the burden on plugin maintainers to address these issues.
      -  Platform workers and model runners consume vLLM APIs that are not guaranteed to be stable - e.g. `vllm.v1.worker.gpu_input_batch.InputBatch`, `vllm.v1.sample.metadata.SamplingMetadata`, `vllm.sampling_params.SamplingType`)
      - Additionally: Lack of basic functionality checks for plugin compatibility in upstream leads to uncertainty about whether the latest plugin will work with the latest vLLM, forcing users to mix and match versions.
   - Lack of clearly defined, immutable interfaces for communication between vLLM and out-of-tree plugins.
     - Both "vLLM-consumed" and "plugin-consumed" API definitions are crucial for stable integration.

3. **No In-Tree Visibility:**
   - Plugins live entirely out-of-tree, requiring users to clone, build, and install specific plugin repositories or packages.
   - Installing "plugin-less" vLLM via `pip install vllm` lacks any skeleton for out-of-tree platform support, and upstream development is also unaware of these backends.
     - If user chooses an out-of-tree backend without installing a proper plugin first, vLLM won't recognize that this platform is indeed supported via an out-of-tree plugin

### Proposed Change.

1. **Registering and Dispatching Semi-Custom Operators:**
  - Implement robust registration and automatic dispatch of out-of-tree operators. There is already a registration and dispatch logic for models with support for overriding (e.g. [register_model](https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/model_executor/models/registry.py#L370-L410)) which can be leveraged by [vendors](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/models/__init__.py#L4-L35).
  - The integration could look as follows:
```python
from vllm import OperatorRegistry
def register_operators():
    from .custom_ops import CustomRotaryEmbedding, CustomRMSNorm

    OperatorRegistry.register_operator(
        "rotary_embedding",
        "vllm_plugin.custom_ops:CustomRotaryEmbedding")

    OperatorRegistry.register_operator(
        "rms_norm",
        "vllm_plugin.custom_ops:CustomRMSNorm")
```
  - The defined operators are expected to match APIs of their corresponding ops, and this can be easily achieved via inheritance, e.g:
```python
from vllm.model_executor.layers.layernorm import RMSNorm
from vllm.model_executor.layers.rotary_embedding import RotaryEmbedding
from vllm.logger import init_logger
logger = init_logger(__name__)

class CustomRMSNorm(RMSNorm):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Additional initialization for the dummy platform can go here.
        logger.warning("CustomRMSNorm initialized.")
    def forward(self, *args, **kwargs):
        logger.warning("CustomRMSNorm fwd pass.")
        return super().forward(*args, **kwargs)
    
class CustomRotaryEmbedding(RotaryEmbedding):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Additional initialization for the dummy platform can go here.
        logger.warning("CustomRotaryEmbedding initialized.")
    def forward(self, *args, **kwargs):
        logger.warning("CustomRotaryEmbedding fwd pass.")
        return super().forward(*args, **kwargs)

```
- The difference between this approach, and `forward_oot` is that vendors have freedom to either completely override the existing initialization logic, perform additional sanity checks, and explicitly define class attributes used in forward passes.

  - **Drawbacks:**
        - This will require refactor of the model code. Models should fetch the operator class by name from the registry rather than utilizing the class from `vllm.model_executor.layers`.
        - This will need to ensure that all in-tree operators are registered before any registration occurs out-of-tree.

2. **Defined Platform Stubs and APIs:**
 - Provide clearly defined API stubs in upstream (e.g., `vllm.platforms.hpu`, `vllm.platforms.npu`) for plugin integration 
     - `Platform` interface is a very good starting point, but vLLM should be aware of platforms that have established plugins and have lightweight "manifest" of them, extensible by out-of-tree code.
     - In-tree stub support with out-of-tree implementation.
 - Require specific, versioned, strongly-typed definitions of exposed APIs and dataclasses, covering both consumed (e.g. `vllm.v1.worker.gpu_input_batch.InputBatch`, `vllm.v1.sample.metadata.SamplingMetadata`, `vllm.sampling_params.SamplingType`) and produced APIs (e.g. `vllm_plugin.v1.worker.init_device, determine_available_memory`)

3. **CI/Code Scanning Checks:**
 - Implement basic CI/code scanning checks in upstream for plugin compatibility.
 - Once platform stubs and produced/consumed APIs are defined, extensions can be covered by static code checks and mypy should detect incompatibilities. 
  - More ideal, long-term solution - actual unit test execution on the pluggable platforms in vLLM PRs

### Feedback Period.

24/7

### CC List.

@simon-mo @WoosukKwon @xuechendi 

### Any Other Things.

The problem statements and proposed solutions are ordered by priority - it's most crucial to address the points in the beginning (the custom ops). The proposed solutions for later points may evolve over time based on feedback.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Enhancing vLLM Plugin Architecture #19161

Motivation.

Problem Statement

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Enhancing vLLM Plugin Architecture #19161

Description

Motivation.

Problem Statement

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions