Skip to content

Commit 68b4a26

Browse files
[Doc] Update V1 User Guide for Hardware and Models (#19474)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
1 parent b8e809a commit 68b4a26

File tree

2 files changed

+83
-68
lines changed

2 files changed

+83
-68
lines changed

docs/usage/v1_guide.md

Lines changed: 81 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# vLLM V1
22

3-
**We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.**
3+
!!! important
4+
5+
We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
46

57
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
68

@@ -32,53 +34,92 @@ Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-
3234

3335
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.
3436

35-
### Supports Overview
36-
#### Hardware
37+
## Current Status
38+
39+
For each item, our progress towards V1 support falls into one of the following states:
40+
41+
- **🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
42+
- **🟢 Functional**: Fully operational, with ongoing optimizations.
43+
- **🚧 WIP**: Under active development.
44+
- **🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
45+
- **🟠 Delayed**: Temporarily dropped in V1 but planned to be re-introduced later.
46+
- **🔴 Deprecated**: Not planned for V1 unless there is strong demand.
47+
48+
### Hardware
49+
50+
| Hardware | Status |
51+
|------------|------------------------------------|
52+
| **NVIDIA** | <nobr>🚀</nobr> |
53+
| **AMD** | <nobr>🟢</nobr> |
54+
| **TPU** | <nobr>🟢</nobr> |
55+
| **CPU** | <nobr>🟢 (x86) 🟡 (MacOS) </nobr> |
56+
57+
!!! note
58+
59+
More hardware platforms may be supported via plugins, e.g.:
60+
61+
- [vllm-ascend](https://github.com/vllm-project/vllm-ascend)
62+
- [vllm-spyre](https://github.com/vllm-project/vllm-spyre)
63+
- [vllm-openvino](https://github.com/vllm-project/vllm-openvino)
64+
65+
Please check their corresponding repositories for more details.
66+
67+
### Models
68+
69+
| Model Type | Status |
70+
|-----------------|-----------------------------------------------------------------------------------|
71+
| **Decoder-only Models** | <nobr>🚀 Optimized</nobr> |
72+
| **Encoder-Decoder Models** | <nobr>🟠 Delayed</nobr> |
73+
| **Embedding Models** | <nobr>🚧 WIP ([PR #16188](https://github.com/vllm-project/vllm/pull/16188))</nobr> |
74+
| **Mamba Models** | <nobr>🚧 WIP ([PR #19327](https://github.com/vllm-project/vllm/pull/19327))</nobr> |
75+
| **Multimodal Models** | <nobr>🟢 Functional</nobr> |
3776

38-
| Hardware | Status |
39-
|----------|------------------------------------------|
40-
| **NVIDIA** | <nobr>🚀 Natively Supported</nobr> |
41-
| **AMD** | <nobr>🚧 WIP</nobr> |
42-
| **TPU** | <nobr>🚧 WIP</nobr> |
43-
| **CPU** | <nobr>🚧 WIP</nobr> |
77+
vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol,
78+
and the majority fall into the following categories:
4479

45-
#### Feature / Model
80+
**Embedding Models**
81+
The initial support will be provided by [PR #16188](https://github.com/vllm-project/vllm/pull/16188).
4682

47-
| Feature / Model | Status |
83+
Later, we will consider using [hidden states processor](https://github.com/vllm-project/vllm/issues/12249),
84+
which is based on [global logits processor](https://github.com/vllm-project/vllm/pull/13360)
85+
to enable simultaneous generation and embedding using the same engine instance in V1.
86+
87+
**Mamba Models**
88+
Models using selective state-space mechanisms instead of standard transformer attention (e.g., `MambaForCausalLM`, `JambaForCausalLM`)
89+
will be supported via [PR #19327](https://github.com/vllm-project/vllm/pull/19327).
90+
91+
**Encoder-Decoder Models**
92+
vLLM V1 is currently optimized for decoder-only transformers.
93+
Models requiring cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`).
94+
95+
For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html).
96+
97+
### Features
98+
99+
| Feature | Status |
48100
|-----------------|-----------------------------------------------------------------------------------|
49-
| **Prefix Caching** | <nobr>🚀 Optimized</nobr> |
50-
| **Chunked Prefill** | <nobr>🚀 Optimized</nobr> |
101+
| **Prefix Caching** | <nobr>🚀 Optimized</nobr> |
102+
| **Chunked Prefill** | <nobr>🚀 Optimized</nobr> |
51103
| **LoRA** | <nobr>🚀 Optimized</nobr> |
52104
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
53-
| **Multimodal Models** | <nobr>🟢 Functional</nobr> |
54105
| **FP8 KV Cache** | <nobr>🟢 Functional on Hopper devices ([PR #15191](https://github.com/vllm-project/vllm/pull/15191))</nobr>|
55106
| **Spec Decode** | <nobr>🚧 WIP ([PR #13933](https://github.com/vllm-project/vllm/pull/13933))</nobr>|
56107
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))</nobr>|
57108
| **Structured Output Alternative Backends** | <nobr>🟢 Functional</nobr> |
58-
| **Embedding Models** | <nobr>🚧 WIP ([PR #16188](https://github.com/vllm-project/vllm/pull/16188))</nobr> |
59-
| **Mamba Models** | <nobr>🟡 Planned</nobr> |
60-
| **Encoder-Decoder Models** | <nobr>🟠 Delayed</nobr> |
61109
| **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr> |
62110
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))</nobr>|
63111
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360))</nobr> |
64112
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> |
65113

66-
- **🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
67-
- **🟢 Functional**: Fully operational, with ongoing optimizations.
68-
- **🚧 WIP**: Under active development.
69-
- **🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
70-
- **🟠 Delayed**: Temporarily dropped in V1 but planned to be re-introduced later.
71-
- **🔴 Deprecated**: Not planned for V1 unless there is strong demand.
114+
!!! note
72115

73-
**Note**: vLLM V1’s unified scheduler treats both prompt and output tokens the same
74-
way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
75-
allocate a fixed token budget per request, enabling features like chunked prefills,
76-
prefix caching, and speculative decoding without a strict separation between prefill
77-
and decode phases.
116+
vLLM V1’s unified scheduler treats both prompt and output tokens the same
117+
way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
118+
allocate a fixed token budget per request, enabling features like chunked prefills,
119+
prefix caching, and speculative decoding without a strict separation between prefill
120+
and decode phases.
78121

79-
### Semantic Changes and Deprecated Features
80-
81-
#### Logprobs
122+
#### Semantic Changes to Logprobs
82123

83124
vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
84125
differences compared to V0:
@@ -96,6 +137,14 @@ Support for logprobs with post-sampling adjustments is in progress and will be a
96137

97138
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](https://github.com/vllm-project/vllm/issues/13414).
98139

140+
#### WIP Features
141+
142+
These features are already supported in vLLM V1, but their optimization is still
143+
in progress.
144+
145+
- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
146+
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). We will prioritize the support for Eagle, MTP compared to draft model based spec decode.
147+
99148
#### Deprecated Features
100149

101150
As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.
@@ -115,39 +164,4 @@ to handle request preemptions.
115164

116165
**Structured Output features**
117166

118-
- **Request-level Structured Output Backend**: Deprecated, alternative backends
119-
(outlines, guidance) with fallbacks is WIP.
120-
### Feature & Model Support in Progress
121-
122-
Although we have re-implemented and partially optimized many features and models from V0 in vLLM V1, optimization work is still ongoing for some, and others remain unsupported.
123-
124-
#### Features to Be Optimized
125-
126-
These features are already supported in vLLM V1, but their optimization is still
127-
in progress.
128-
129-
- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
130-
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). We will prioritize the support for Eagle, MTP compared to draft model based spec decode.
131-
132-
- **Multimodal Models**: V1 is almost fully compatible with V0 except that interleaved modality input is not supported yet.
133-
See [here](https://github.com/orgs/vllm-project/projects/8) for the status of upcoming features and optimizations.
134-
135-
#### Models to Be Supported
136-
137-
vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol,
138-
and the majority fall into the following categories. V1 support for these models will be added eventually.
139-
140-
**Embedding Models**
141-
The initial support will be provided by [PR #16188](https://github.com/vllm-project/vllm/pull/16188).
142-
143-
Later, we will consider using [hidden states processor](https://github.com/vllm-project/vllm/issues/12249), which is based on [global logits processor](https://github.com/vllm-project/vllm/pull/13360) to enable simultaneous generation and embedding using the same engine instance in V1.
144-
145-
**Mamba Models**
146-
Models using selective state-space mechanisms (instead of standard transformer attention)
147-
are not yet supported (e.g., `MambaForCausalLM`, `JambaForCausalLM`).
148-
149-
**Encoder-Decoder Models**
150-
vLLM V1 is currently optimized for decoder-only transformers. Models requiring
151-
cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`).
152-
153-
For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html).
167+
- **Request-level Structured Output Backend**: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now.

vllm/engine/arg_utils.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1440,7 +1440,8 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool:
14401440
_raise_or_fallback(feature_name=name, recommend_to_remove=False)
14411441
return False
14421442

1443-
# Non-[CUDA, TPU] may be supported on V1, but off by default for now.
1443+
# Non-[CUDA, TPU, x86 CPU] may be supported on V1,
1444+
# but off by default for now.
14441445
v0_hardware = not any(
14451446
(current_platform.is_cuda_alike(), current_platform.is_tpu(),
14461447
(current_platform.is_cpu()

0 commit comments

Comments
 (0)