Skip to content

Commit 6dd55af

Browse files
[Doc] Update docs on handling OOM (vllm-project#15357)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com>
1 parent 3eb08ed commit 6dd55af

File tree

6 files changed

+24
-9
lines changed

6 files changed

+24
-9
lines changed

docs/source/getting_started/installation/cpu.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -193,7 +193,7 @@ vLLM CPU backend supports the following vLLM features:
193193

194194
## Related runtime environment variables
195195

196-
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
196+
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
197197
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores.
198198
- `VLLM_CPU_MOE_PREPACK`: whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
199199

docs/source/getting_started/v1_user_guide.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,9 @@ vLLM V1 is currently optimized for decoder-only transformers. Models requiring
156156

157157
For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html).
158158

159-
## FAQ
159+
## Frequently Asked Questions
160160

161-
TODO
161+
**I'm using vLLM V1 and I'm getting CUDA OOM errors. What should I do?**
162+
The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1. If you encounter CUDA OOM only when using V1 engine, try setting a lower value of `max_num_seqs` or `gpu_memory_utilization`.
163+
164+
On the other hand, if you get an error about insufficient memory for the cache blocks, you should increase `gpu_memory_utilization` as this indicates that your GPU has sufficient memory but you're not allocating enough to vLLM for KV cache blocks.

docs/source/serving/engine_args.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,12 @@
22

33
# Engine Arguments
44

5-
Below, you can find an explanation of every engine argument for vLLM:
5+
Engine arguments control the behavior of the vLLM engine.
6+
7+
- For [offline inference](#offline-inference), they are part of the arguments to `LLM` class.
8+
- For [online serving](#openai-compatible-server), they are part of the arguments to `vllm serve`.
9+
10+
Below, you can find an explanation of every engine argument:
611

712
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
813
```{eval-rst}
@@ -15,7 +20,7 @@ Below, you can find an explanation of every engine argument for vLLM:
1520

1621
## Async Engine Arguments
1722

18-
Below are the additional arguments related to the asynchronous engine:
23+
Additional arguments are available to the asynchronous engine which is used for online serving:
1924

2025
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
2126
```{eval-rst}

docs/source/serving/offline_inference.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,13 @@ llm = LLM(model="adept/fuyu-8b",
9797
max_num_seqs=2)
9898
```
9999

100+
#### Adjust cache size
101+
102+
If you run out of CPU RAM, try the following options:
103+
104+
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
105+
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
106+
100107
### Performance optimization and tuning
101108

102109
You can potentially improve the performance of vLLM by finetuning various options.

vllm/envs.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -340,7 +340,7 @@ def maybe_convert_int(value: Optional[str]) -> Optional[int]:
340340
lambda: os.getenv("VLLM_PP_LAYER_PARTITION", None),
341341

342342
# (CPU backend only) CPU key-value cache space.
343-
# default is 4GB
343+
# default is 4 GiB
344344
"VLLM_CPU_KVCACHE_SPACE":
345345
lambda: int(os.getenv("VLLM_CPU_KVCACHE_SPACE", "0")),
346346

@@ -412,9 +412,9 @@ def maybe_convert_int(value: Optional[str]) -> Optional[int]:
412412
lambda: int(os.getenv("VLLM_AUDIO_FETCH_TIMEOUT", "10")),
413413

414414
# Cache size (in GiB) for multimodal input cache
415-
# Default is 8GiB
415+
# Default is 4 GiB
416416
"VLLM_MM_INPUT_CACHE_GIB":
417-
lambda: int(os.getenv("VLLM_MM_INPUT_CACHE_GIB", "8")),
417+
lambda: int(os.getenv("VLLM_MM_INPUT_CACHE_GIB", "4")),
418418

419419
# Path to the XLA persistent cache directory.
420420
# Only used for XLA devices such as TPUs.

vllm/platforms/cpu.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ def check_and_update_config(cls, vllm_config: VllmConfig) -> None:
9292
if kv_cache_space == 0:
9393
cache_config.cpu_kvcache_space_bytes = 4 * GiB_bytes # type: ignore
9494
logger.warning(
95-
"Environment variable VLLM_CPU_KVCACHE_SPACE (GB) "
95+
"Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) "
9696
"for CPU backend is not set, using 4 by default.")
9797
else:
9898
cache_config.cpu_kvcache_space_bytes = kv_cache_space * GiB_bytes # type: ignore # noqa

0 commit comments

Comments
 (0)