Skip to content

Commit 15df8be

Browse files
authored
[Doc] Add sleep mode doc (#1295)
### What this PR does / why we need it? Add sleep related doc and example --------- Signed-off-by: wangli <wangli858794774@gmail.com>
1 parent e4e0b7a commit 15df8be

File tree

4 files changed

+173
-0
lines changed

4 files changed

+173
-0
lines changed

.github/workflows/vllm_ascend_test.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ on:
4343
- '**/*.py'
4444
- '.github/workflows/vllm_ascend_test.yaml'
4545
- '!docs/**'
46+
- '!examples/**'
4647
- 'pytest.ini'
4748
- '!benchmarks/**'
4849
- 'tools/mypy.sh'

docs/source/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ user_guide/suppoted_features
4747
user_guide/supported_models
4848
user_guide/env_vars
4949
user_guide/additional_config
50+
user_guide/sleep_mode
5051
user_guide/graph_mode.md
5152
user_guide/quantization.md
5253
user_guide/release_notes

docs/source/user_guide/sleep_mode.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# Sleep Mode
2+
3+
## Overview
4+
5+
Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs auto-regressive generation using inference engines like vLLM, followed by forward and backward passes for optimization.
6+
7+
Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU.
8+
9+
10+
## Getting started
11+
12+
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
13+
14+
The engine(v0/v1) supports two sleep levels to manage memory during idle periods:
15+
16+
- Level 1 Sleep
17+
- Action: Offloads model weights and discards the KV cache.
18+
- Memory: Model weights are moved to CPU memory; KV cache is forgotten.
19+
- Use Case: Suitable when reusing the same model later.
20+
- Note: Ensure sufficient CPU memory is available to hold the model weights.
21+
22+
- Level 2 Sleep
23+
- Action: Discards both model weights and KV cache.
24+
- Memory: The content of both the model weights and kv cache is forgotten.
25+
- Use Case: Ideal when switching to a different model or updating the current one.
26+
27+
Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.
28+
29+
## Usage
30+
31+
The following is a simple example of how to use sleep mode.
32+
33+
- offline inference:
34+
35+
```python
36+
import os
37+
38+
import torch
39+
from vllm import LLM, SamplingParams
40+
from vllm.utils import GiB_bytes
41+
42+
43+
os.environ["VLLM_USE_V1"] = "1"
44+
os.environ["VLLM_USE_MODELSCOPE"] = "True"
45+
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
46+
47+
if __name__ == "__main__":
48+
prompt = "How are you?"
49+
50+
free, total = torch.npu.mem_get_info()
51+
print(f"Free memory before sleep: {free / 1024 ** 3:.2f} GiB")
52+
# record npu memory use baseline in case other process is running
53+
used_bytes_baseline = total - free
54+
llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)
55+
sampling_params = SamplingParams(temperature=0, max_tokens=10)
56+
output = llm.generate(prompt, sampling_params)
57+
58+
llm.sleep(level=1)
59+
60+
free_npu_bytes_after_sleep, total = torch.npu.mem_get_info()
61+
print(f"Free memory after sleep: {free_npu_bytes_after_sleep / 1024 ** 3:.2f} GiB")
62+
used_bytes = total - free_npu_bytes_after_sleep - used_bytes_baseline
63+
# now the memory usage should be less than the model weights
64+
# (0.5B model, 1GiB weights)
65+
assert used_bytes < 1 * GiB_bytes
66+
67+
llm.wake_up()
68+
output2 = llm.generate(prompt, sampling_params)
69+
# cmp output
70+
assert output[0].outputs[0].text == output2[0].outputs[0].text
71+
```
72+
73+
- online serving:
74+
:::{note}
75+
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up).
76+
:::
77+
78+
```bash
79+
export VLLM_SERVER_DEV_MODE="1"
80+
export VLLM_USE_V1="1"
81+
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
82+
export VLLM_USE_MODELSCOPE="True"
83+
84+
vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode
85+
86+
# after serveing is up, post these endpoints
87+
88+
# sleep level 1
89+
curl -X POST http://127.0.0.1:8000/sleep \
90+
-H "Content-Type: application/json" \
91+
-d '{"level": "1"}'
92+
93+
curl -X GET http://127.0.0.1:8000/is_sleeping
94+
95+
# sleep level 2
96+
curl -X POST http://127.0.0.1:8000/sleep \
97+
-H "Content-Type: application/json" \
98+
-d '{"level": "2"}'
99+
100+
# wake up
101+
curl -X POST http://127.0.0.1:8000/wake_up
102+
103+
# wake up with tag, tags must be in ["weights", "kv_cache"]
104+
curl -X POST "http://127.0.0.1:8000/wake_up?tags=weights"
105+
106+
curl -X GET http://127.0.0.1:8000/is_sleeping
107+
108+
# after sleep and wake up, the serving is still available
109+
curl http://localhost:8000/v1/completions \
110+
-H "Content-Type: application/json" \
111+
-d '{
112+
"model": "Qwen/Qwen2.5-0.5B-Instruct",
113+
"prompt": "The future of AI is",
114+
"max_tokens": 7,
115+
"temperature": 0
116+
}'
117+
```
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
#
2+
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
3+
# Copyright 2023 The vLLM team.
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
# This file is a part of the vllm-ascend project.
17+
#
18+
19+
import os
20+
21+
import torch
22+
from vllm import LLM, SamplingParams
23+
from vllm.utils import GiB_bytes
24+
25+
os.environ["VLLM_USE_V1"] = "1"
26+
os.environ["VLLM_USE_MODELSCOPE"] = "True"
27+
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
28+
29+
if __name__ == "__main__":
30+
prompt = "How are you?"
31+
32+
free, total = torch.npu.mem_get_info()
33+
print(f"Free memory before sleep: {free / 1024 ** 3:.2f} GiB")
34+
# record npu memory use baseline in case other process is running
35+
used_bytes_baseline = total - free
36+
llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)
37+
sampling_params = SamplingParams(temperature=0, max_tokens=10)
38+
output = llm.generate(prompt, sampling_params)
39+
40+
llm.sleep(level=1)
41+
42+
free_npu_bytes_after_sleep, total = torch.npu.mem_get_info()
43+
print(
44+
f"Free memory after sleep: {free_npu_bytes_after_sleep / 1024 ** 3:.2f} GiB"
45+
)
46+
used_bytes = total - free_npu_bytes_after_sleep - used_bytes_baseline
47+
# now the memory usage should be less than the model weights
48+
# (0.5B model, 1GiB weights)
49+
assert used_bytes < 1 * GiB_bytes
50+
51+
llm.wake_up()
52+
output2 = llm.generate(prompt, sampling_params)
53+
# cmp output
54+
assert output[0].outputs[0].text == output2[0].outputs[0].text

0 commit comments

Comments
 (0)