Skip to content

Commit 1025344

Browse files
authored
Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode (#1374)
### What this PR does / why we need it? Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode. Relate RFC: #1248 ### Does this PR introduce _any_ user-facing change? No changes. ### How was this patch tested? Preview Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: leo-pony <nengjunma@outlook.com>
1 parent 53c2d58 commit 1025344

File tree

1 file changed

+73
-4
lines changed

1 file changed

+73
-4
lines changed

docs/source/tutorials/single_npu.md

Lines changed: 73 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,22 +42,62 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
4242

4343
Run the following script to execute offline inference on a single NPU:
4444

45-
```python
45+
:::::{tab-set}
46+
::::{tab-item} Graph Mode
47+
48+
```{code-block} python
49+
:substitutions:
50+
import os
4651
from vllm import LLM, SamplingParams
4752
53+
os.environ["VLLM_USE_V1"] = "1"
54+
4855
prompts = [
4956
"Hello, my name is",
5057
"The future of AI is",
5158
]
5259
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
53-
llm = LLM(model="Qwen/Qwen3-8B", max_model_len=26240)
60+
llm = LLM(
61+
model="Qwen/Qwen3-8B",
62+
max_model_len=26240
63+
)
5464
5565
outputs = llm.generate(prompts, sampling_params)
5666
for output in outputs:
5767
prompt = output.prompt
5868
generated_text = output.outputs[0].text
5969
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
6070
```
71+
::::
72+
73+
::::{tab-item} Eager Mode
74+
75+
```{code-block} python
76+
:substitutions:
77+
import os
78+
from vllm import LLM, SamplingParams
79+
80+
os.environ["VLLM_USE_V1"] = "1"
81+
82+
prompts = [
83+
"Hello, my name is",
84+
"The future of AI is",
85+
]
86+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
87+
llm = LLM(
88+
model="Qwen/Qwen3-8B",
89+
max_model_len=26240,
90+
enforce_eager=True
91+
)
92+
93+
outputs = llm.generate(prompts, sampling_params)
94+
for output in outputs:
95+
prompt = output.prompt
96+
generated_text = output.outputs[0].text
97+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
98+
```
99+
::::
100+
:::::
61101

62102
If you run this script successfully, you can see the info shown below:
63103

@@ -70,9 +110,11 @@ Prompt: 'The future of AI is', Generated text: ' following you. As the technolog
70110

71111
Run docker container to start the vLLM server on a single NPU:
72112

113+
:::::{tab-set}
114+
::::{tab-item} Graph Mode
115+
73116
```{code-block} bash
74117
:substitutions:
75-
76118
# Update the vllm-ascend image
77119
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
78120
docker run --rm \
@@ -91,8 +133,35 @@ docker run --rm \
91133
-e VLLM_USE_MODELSCOPE=True \
92134
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
93135
-it $IMAGE \
94-
vllm serve Qwen/Qwen3-8B --max_model_len 26240
136+
VLLM_USE_V1=1 vllm serve Qwen/Qwen3-8B --max_model_len 26240
137+
```
138+
::::
139+
140+
::::{tab-item} Eager Mode
141+
142+
```{code-block} bash
143+
:substitutions:
144+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
145+
docker run --rm \
146+
--name vllm-ascend \
147+
--device /dev/davinci0 \
148+
--device /dev/davinci_manager \
149+
--device /dev/devmm_svm \
150+
--device /dev/hisi_hdc \
151+
-v /usr/local/dcmi:/usr/local/dcmi \
152+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
153+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
154+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
155+
-v /etc/ascend_install.info:/etc/ascend_install.info \
156+
-v /root/.cache:/root/.cache \
157+
-p 8000:8000 \
158+
-e VLLM_USE_MODELSCOPE=True \
159+
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
160+
-it $IMAGE \
161+
VLLM_USE_V1=1 vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
95162
```
163+
::::
164+
:::::
96165

97166
:::{note}
98167
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.

0 commit comments

Comments
 (0)