You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/features/spec_decode.md
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -217,8 +217,8 @@ an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https
217
217
A few important things to consider when using the EAGLE based draft models:
218
218
219
219
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
220
-
be able to be loaded and used directly by vLLM after [PR 12304](https://github.com/vllm-project/vllm/pull/12304).
221
-
If you are using vllm version before [PR 12304](https://github.com/vllm-project/vllm/pull/12304), please use the
220
+
be able to be loaded and used directly by vLLM after <gh-pr:12304>.
221
+
If you are using vllm version before <gh-pr:12304>, please use the
222
222
[script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
223
223
and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`. If weight-loading problems still occur when using the latest version of vLLM, please leave a comment or raise an issue.
224
224
@@ -228,7 +228,7 @@ A few important things to consider when using the EAGLE based draft models:
228
228
229
229
3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
230
230
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
231
-
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
231
+
investigation and tracked here: <gh-issue:9565>.
232
232
233
233
A variety of EAGLE draft models are available on the Hugging Face hub:
Copy file name to clipboardExpand all lines: docs/usage/troubleshooting.md
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -212,7 +212,7 @@ if __name__ == '__main__':
212
212
213
213
## `torch.compile` Error
214
214
215
-
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
215
+
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](gh-pr:10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
216
216
217
217
??? Code
218
218
@@ -231,7 +231,7 @@ vLLM heavily depends on `torch.compile` to optimize the model for better perform
231
231
print(f(x))
232
232
```
233
233
234
-
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example.
234
+
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See <gh-issue:12219> for example.
Copy file name to clipboardExpand all lines: docs/usage/v1_guide.md
+12-12Lines changed: 12 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
!!! announcement
4
4
5
-
We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
5
+
We have started the process of deprecating V0. Please read [RFC #18571](gh-issue:18571) for more details.
6
6
7
7
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
8
8
@@ -83,7 +83,7 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
|**GPU <> CPU KV Cache Swapping**| <nobr>🔴 Deprecated</nobr> |
131
131
132
132
!!! note
@@ -153,19 +153,19 @@ Support for logprobs with post-sampling adjustments is in progress and will be a
153
153
154
154
**Prompt Logprobs with Prefix Caching**
155
155
156
-
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](https://github.com/vllm-project/vllm/issues/13414).
156
+
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](gh-issue:13414).
157
157
158
158
#### Deprecated Features
159
159
160
160
As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.
161
161
162
162
**Sampling features**
163
163
164
-
-**best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
164
+
-**best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361).
165
165
-**Per-Request Logits Processors**: In V0, users could pass custom
166
166
processing functions to adjust logits on a per-request basis. In vLLM V1, this
167
167
feature has been deprecated. Instead, the design is moving toward supporting **global logits
168
-
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](https://github.com/vllm-project/vllm/pull/13360).
168
+
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360).
0 commit comments