|
| 1 | +# Sleep Mode |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs auto-regressive generation using inference engines like vLLM, followed by forward and backward passes for optimization. |
| 6 | + |
| 7 | +Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU. |
| 8 | + |
| 9 | + |
| 10 | +## Getting started |
| 11 | + |
| 12 | +With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`. |
| 13 | + |
| 14 | +The engine(v0/v1) supports two sleep levels to manage memory during idle periods: |
| 15 | + |
| 16 | +- Level 1 Sleep |
| 17 | + - Action: Offloads model weights and discards the KV cache. |
| 18 | + - Memory: Model weights are moved to CPU memory; KV cache is forgotten. |
| 19 | + - Use Case: Suitable when reusing the same model later. |
| 20 | + - Note: Ensure sufficient CPU memory is available to hold the model weights. |
| 21 | + |
| 22 | +- Level 2 Sleep |
| 23 | + - Action: Discards both model weights and KV cache. |
| 24 | + - Memory: The content of both the model weights and kv cache is forgotten. |
| 25 | + - Use Case: Ideal when switching to a different model or updating the current one. |
| 26 | + |
| 27 | +Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source. |
| 28 | + |
| 29 | +## Usage |
| 30 | + |
| 31 | +The following is a simple example of how to use sleep mode. |
| 32 | + |
| 33 | +- offline inference: |
| 34 | + |
| 35 | + ```python |
| 36 | + import os |
| 37 | + |
| 38 | + import torch |
| 39 | + from vllm import LLM, SamplingParams |
| 40 | + from vllm.utils import GiB_bytes |
| 41 | + |
| 42 | + |
| 43 | + os.environ["VLLM_USE_V1"] = "1" |
| 44 | + os.environ["VLLM_USE_MODELSCOPE"] = "True" |
| 45 | + os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" |
| 46 | + |
| 47 | + if __name__ == "__main__": |
| 48 | + prompt = "How are you?" |
| 49 | + |
| 50 | + free, total = torch.npu.mem_get_info() |
| 51 | + print(f"Free memory before sleep: {free / 1024 ** 3:.2f} GiB") |
| 52 | + # record npu memory use baseline in case other process is running |
| 53 | + used_bytes_baseline = total - free |
| 54 | + llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True) |
| 55 | + sampling_params = SamplingParams(temperature=0, max_tokens=10) |
| 56 | + output = llm.generate(prompt, sampling_params) |
| 57 | + |
| 58 | + llm.sleep(level=1) |
| 59 | + |
| 60 | + free_npu_bytes_after_sleep, total = torch.npu.mem_get_info() |
| 61 | + print(f"Free memory after sleep: {free_npu_bytes_after_sleep / 1024 ** 3:.2f} GiB") |
| 62 | + used_bytes = total - free_npu_bytes_after_sleep - used_bytes_baseline |
| 63 | + # now the memory usage should be less than the model weights |
| 64 | + # (0.5B model, 1GiB weights) |
| 65 | + assert used_bytes < 1 * GiB_bytes |
| 66 | + |
| 67 | + llm.wake_up() |
| 68 | + output2 = llm.generate(prompt, sampling_params) |
| 69 | + # cmp output |
| 70 | + assert output[0].outputs[0].text == output2[0].outputs[0].text |
| 71 | + ``` |
| 72 | + |
| 73 | +- online serving: |
| 74 | + :::{note} |
| 75 | + Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up). |
| 76 | + ::: |
| 77 | + |
| 78 | + ```bash |
| 79 | + export VLLM_SERVER_DEV_MODE="1" |
| 80 | + export VLLM_USE_V1="1" |
| 81 | + export VLLM_WORKER_MULTIPROC_METHOD="spawn" |
| 82 | + export VLLM_USE_MODELSCOPE="True" |
| 83 | + |
| 84 | + vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode |
| 85 | + |
| 86 | + # after serveing is up, post these endpoints |
| 87 | + |
| 88 | + # sleep level 1 |
| 89 | + curl -X POST http://127.0.0.1:8000/sleep \ |
| 90 | + -H "Content-Type: application/json" \ |
| 91 | + -d '{"level": "1"}' |
| 92 | + |
| 93 | + curl -X GET http://127.0.0.1:8000/is_sleeping |
| 94 | + |
| 95 | + # sleep level 2 |
| 96 | + curl -X POST http://127.0.0.1:8000/sleep \ |
| 97 | + -H "Content-Type: application/json" \ |
| 98 | + -d '{"level": "2"}' |
| 99 | + |
| 100 | + # wake up |
| 101 | + curl -X POST http://127.0.0.1:8000/wake_up |
| 102 | + |
| 103 | + # wake up with tag, tags must be in ["weights", "kv_cache"] |
| 104 | + curl -X POST "http://127.0.0.1:8000/wake_up?tags=weights" |
| 105 | + |
| 106 | + curl -X GET http://127.0.0.1:8000/is_sleeping |
| 107 | + |
| 108 | + # after sleep and wake up, the serving is still available |
| 109 | + curl http://localhost:8000/v1/completions \ |
| 110 | + -H "Content-Type: application/json" \ |
| 111 | + -d '{ |
| 112 | + "model": "Qwen/Qwen2.5-0.5B-Instruct", |
| 113 | + "prompt": "The future of AI is", |
| 114 | + "max_tokens": 7, |
| 115 | + "temperature": 0 |
| 116 | + }' |
| 117 | + ``` |
0 commit comments