[Doc] Add sleep mode doc (#1295)

Potabk · web-flow · commit 15df8be93737 · 2025-06-25T14:07:14.000+08:00
### What this PR does / why we need it?
Add sleep related doc and example

---------

Signed-off-by: wangli &lt;wangli858794774@gmail.com&gt;
diff --git a/.github/workflows/vllm_ascend_test.yaml b/.github/workflows/vllm_ascend_test.yaml
@@ -43,6 +43,7 @@ on:
       - '**/*.py'
       - '.github/workflows/vllm_ascend_test.yaml'
       - '!docs/**'
+      - '!examples/**'
       - 'pytest.ini'
       - '!benchmarks/**'
       - 'tools/mypy.sh'
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -47,6 +47,7 @@ user_guide/suppoted_features
 user_guide/supported_models
 user_guide/env_vars
 user_guide/additional_config
+user_guide/sleep_mode
 user_guide/graph_mode.md
 user_guide/quantization.md
 user_guide/release_notes
diff --git a/docs/source/user_guide/sleep_mode.md b/docs/source/user_guide/sleep_mode.md
@@ -0,0 +1,117 @@
+# Sleep Mode
+
+## Overview
+
+Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs auto-regressive generation using inference engines like vLLM, followed by forward and backward passes for optimization.
+
+Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU.
+
+
+## Getting started
+
+With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
+
+The engine(v0/v1) supports two sleep levels to manage memory during idle periods:
+
+- Level 1 Sleep
+    - Action: Offloads model weights and discards the KV cache.
+    - Memory: Model weights are moved to CPU memory; KV cache is forgotten.
+    - Use Case: Suitable when reusing the same model later.
+    - Note: Ensure sufficient CPU memory is available to hold the model weights.
+
+- Level 2 Sleep
+    - Action: Discards both model weights and KV cache.
+    - Memory: The content of both the model weights and kv cache is forgotten.
+    - Use Case: Ideal when switching to a different model or updating the current one.
+
+Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.
+
+## Usage
+
+The following is a simple example of how to use sleep mode.
+
+- offline inference:
+
+    ```python
+    import os
+
+    import torch
+    from vllm import LLM, SamplingParams
+    from vllm.utils import GiB_bytes
+
+
+    os.environ["VLLM_USE_V1"] = "1"
+    os.environ["VLLM_USE_MODELSCOPE"] = "True"
+    os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
+
+    if __name__ == "__main__":
+        prompt = "How are you?"
+
+        free, total = torch.npu.mem_get_info()
+        print(f"Free memory before sleep: {free / 1024 ** 3:.2f} GiB")
+        # record npu memory use baseline in case other process is running
+        used_bytes_baseline = total - free
+        llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)
+        sampling_params = SamplingParams(temperature=0, max_tokens=10)
+        output = llm.generate(prompt, sampling_params)
+
+        llm.sleep(level=1)
+
+        free_npu_bytes_after_sleep, total = torch.npu.mem_get_info()
+        print(f"Free memory after sleep: {free_npu_bytes_after_sleep / 1024 ** 3:.2f} GiB")
+        used_bytes = total - free_npu_bytes_after_sleep - used_bytes_baseline
+        # now the memory usage should be less than the model weights
+        # (0.5B model, 1GiB weights)
+        assert used_bytes < 1 * GiB_bytes
+
+        llm.wake_up()
+        output2 = llm.generate(prompt, sampling_params)
+        # cmp output
+        assert output[0].outputs[0].text == output2[0].outputs[0].text
+    ```
+
+- online serving:
+    :::{note}
+    Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up).
+    :::
+
+    ```bash
+    export VLLM_SERVER_DEV_MODE="1"
+    export VLLM_USE_V1="1"
+    export VLLM_WORKER_MULTIPROC_METHOD="spawn"
+    export VLLM_USE_MODELSCOPE="True"
+
+    vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode
+
+    # after serveing is up, post these endpoints
+
+    # sleep level 1
+    curl -X POST http://127.0.0.1:8000/sleep \
+        -H "Content-Type: application/json" \
+        -d '{"level": "1"}'
+
+    curl -X GET http://127.0.0.1:8000/is_sleeping
+
+    # sleep level 2
+    curl -X POST http://127.0.0.1:8000/sleep \
+        -H "Content-Type: application/json" \
+        -d '{"level": "2"}'
+
+    # wake up
+    curl -X POST http://127.0.0.1:8000/wake_up
+
+    # wake up with tag, tags must be in ["weights", "kv_cache"]
+    curl -X POST "http://127.0.0.1:8000/wake_up?tags=weights"
+
+    curl -X GET http://127.0.0.1:8000/is_sleeping
+
+    # after sleep and wake up, the serving is still available
+    curl http://localhost:8000/v1/completions \
+        -H "Content-Type: application/json" \
+        -d '{
+            "model": "Qwen/Qwen2.5-0.5B-Instruct",
+            "prompt": "The future of AI is",
+            "max_tokens": 7,
+            "temperature": 0
+        }'
+    ```
diff --git a/examples/offline_inference_sleep_mode_npu.py b/examples/offline_inference_sleep_mode_npu.py
@@ -0,0 +1,54 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# Copyright 2023 The vLLM team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+#
+
+import os
+
+import torch
+from vllm import LLM, SamplingParams
+from vllm.utils import GiB_bytes
+
+os.environ["VLLM_USE_V1"] = "1"
+os.environ["VLLM_USE_MODELSCOPE"] = "True"
+os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
+
+if __name__ == "__main__":
+    prompt = "How are you?"
+
+    free, total = torch.npu.mem_get_info()
+    print(f"Free memory before sleep: {free / 1024 ** 3:.2f} GiB")
+    # record npu memory use baseline in case other process is running
+    used_bytes_baseline = total - free
+    llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)
+    sampling_params = SamplingParams(temperature=0, max_tokens=10)
+    output = llm.generate(prompt, sampling_params)
+
+    llm.sleep(level=1)
+
+    free_npu_bytes_after_sleep, total = torch.npu.mem_get_info()
+    print(
+        f"Free memory after sleep: {free_npu_bytes_after_sleep / 1024 ** 3:.2f} GiB"
+    )
+    used_bytes = total - free_npu_bytes_after_sleep - used_bytes_baseline
+    # now the memory usage should be less than the model weights
+    # (0.5B model, 1GiB weights)
+    assert used_bytes < 1 * GiB_bytes
+
+    llm.wake_up()
+    output2 = llm.generate(prompt, sampling_params)
+    # cmp output
+    assert output[0].outputs[0].text == output2[0].outputs[0].text