Skip to content

Commit b5b7e0e

Browse files
authored
[Doc] Add qwen3 embedding 8b guide (#1734)
1. Add the tutorials for qwen3-embedding-8b 2. Remove VLLM_USE_V1=1 in docs, it's useless any more from 0.9.2 - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@5923ab9 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
1 parent 9c560b0 commit b5b7e0e

File tree

12 files changed

+106
-31
lines changed

12 files changed

+106
-31
lines changed

docs/source/developer_guide/contribution/testing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ There are several principles to follow when writing unit tests:
156156
```bash
157157
# Run unit tests
158158
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/$(uname -m)-linux/devlib
159-
VLLM_USE_V1=1 TORCH_DEVICE_BACKEND_AUTOLOAD=0 pytest -sv tests/ut
159+
TORCH_DEVICE_BACKEND_AUTOLOAD=0 pytest -sv tests/ut
160160
```
161161

162162
::::

docs/source/tutorials/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
single_npu
77
single_npu_multimodal
88
single_npu_audio
9+
single_npu_qwen3_embedding
910
multi_npu
1011
multi_npu_moge
1112
multi_npu_qwen3_moe

docs/source/tutorials/multi_node.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,6 @@ export TP_SOCKET_IFNAME=$nic_name
108108
export HCCL_SOCKET_IFNAME=$nic_name
109109
export OMP_PROC_BIND=false
110110
export OMP_NUM_THREADS=100
111-
export VLLM_USE_V1=1
112111
export HCCL_BUFFSIZE=1024
113112

114113
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8

docs/source/tutorials/multi_npu_quantization.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Multi-NPU (QwQ 32B W8A8)
22

3-
## Run docker container:
3+
## Run docker container
44
:::{note}
55
w8a8 quantization feature is supported by v0.8.4rc2 or higher
66
:::

docs/source/tutorials/multi_npu_qwen3_moe.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,6 @@ export VLLM_USE_MODELSCOPE=True
3535

3636
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
3737
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
38-
39-
# For vllm-ascend 0.9.2+, the V1 engine is enabled by default and no longer needs to be explicitly specified.
40-
export VLLM_USE_V1=1
4138
```
4239

4340
### Online Inference on Multi-NPU

docs/source/tutorials/single_node_300i.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,6 @@ Run the following command to start the vLLM server:
6060

6161
```{code-block} bash
6262
:substitutions:
63-
export VLLM_USE_V1=1
6463
vllm serve Qwen/Qwen3-0.6B \
6564
--tensor-parallel-size 1 \
6665
--enforce-eager \
@@ -90,7 +89,6 @@ Run the following command to start the vLLM server:
9089

9190
```{code-block} bash
9291
:substitutions:
93-
export VLLM_USE_V1=1
9492
vllm serve Qwen/Qwen2.5-7B-Instruct \
9593
--tensor-parallel-size 2 \
9694
--enforce-eager \
@@ -129,7 +127,7 @@ Run the following command to start the vLLM server:
129127
```{code-block} bash
130128
:substitutions:
131129
132-
VLLM_USE_V1=1 vllm serve /home/pangu-pro-moe-mode/ \
130+
vllm serve /home/pangu-pro-moe-mode/ \
133131
--tensor-parallel-size 4 \
134132
--enable-expert-parallel \
135133
--dtype "float16" \
@@ -321,7 +319,7 @@ if __name__ == "__main__":
321319

322320
Run script:
323321
```bash
324-
VLLM_USE_V1=1 python example.py
322+
python example.py
325323
```
326324

327325
If you run this script successfully, you can see the info shown below:

docs/source/tutorials/single_npu.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,6 @@ Run the following script to execute offline inference on a single NPU:
5050
import os
5151
from vllm import LLM, SamplingParams
5252
53-
os.environ["VLLM_USE_V1"] = "1"
54-
5553
prompts = [
5654
"Hello, my name is",
5755
"The future of AI is",
@@ -77,8 +75,6 @@ for output in outputs:
7775
import os
7876
from vllm import LLM, SamplingParams
7977
80-
os.environ["VLLM_USE_V1"] = "1"
81-
8278
prompts = [
8379
"Hello, my name is",
8480
"The future of AI is",
@@ -133,7 +129,7 @@ docker run --rm \
133129
-e VLLM_USE_MODELSCOPE=True \
134130
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
135131
-it $IMAGE \
136-
VLLM_USE_V1=1 vllm serve Qwen/Qwen3-8B --max_model_len 26240
132+
vllm serve Qwen/Qwen3-8B --max_model_len 26240
137133
```
138134
::::
139135

@@ -158,7 +154,7 @@ docker run --rm \
158154
-e VLLM_USE_MODELSCOPE=True \
159155
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
160156
-it $IMAGE \
161-
VLLM_USE_V1=1 vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
157+
vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
162158
```
163159
::::
164160
:::::

docs/source/tutorials/single_npu_audio.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,6 @@ docker run --rm \
2929
Setup environment variables:
3030

3131
```bash
32-
# Use vllm v1 engine
33-
export VLLM_USE_V1=1
34-
3532
# Load model from ModelScope to speed up download
3633
export VLLM_USE_MODELSCOPE=True
3734

docs/source/tutorials/single_npu_multimodal.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,6 @@ docker run --rm \
2929
Setup environment variables:
3030

3131
```bash
32-
# Use vllm v1 engine
33-
export VLLM_USE_V1=1
34-
3532
# Load model from ModelScope to speed up download
3633
export VLLM_USE_MODELSCOPE=True
3734

@@ -143,7 +140,6 @@ docker run --rm \
143140
-v /etc/ascend_install.info:/etc/ascend_install.info \
144141
-v /root/.cache:/root/.cache \
145142
-p 8000:8000 \
146-
-e VLLM_USE_V1=1 \
147143
-e VLLM_USE_MODELSCOPE=True \
148144
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
149145
-it $IMAGE \
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Single NPU (Qwen3-Embedding-8B)
2+
3+
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
4+
5+
## Run docker container
6+
7+
Take Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
8+
9+
```{code-block} bash
10+
:substitutions:
11+
# Update the vllm-ascend image
12+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
13+
docker run --rm \
14+
--name vllm-ascend \
15+
--device /dev/davinci0 \
16+
--device /dev/davinci_manager \
17+
--device /dev/devmm_svm \
18+
--device /dev/hisi_hdc \
19+
-v /usr/local/dcmi:/usr/local/dcmi \
20+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
21+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
22+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
23+
-v /etc/ascend_install.info:/etc/ascend_install.info \
24+
-v /root/.cache:/root/.cache \
25+
-p 8000:8000 \
26+
-it $IMAGE bash
27+
```
28+
29+
Setup environment variables:
30+
31+
```bash
32+
# Load model from ModelScope to speed up download
33+
export VLLM_USE_MODELSCOPE=True
34+
35+
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
36+
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
37+
```
38+
39+
### Online Inference
40+
41+
```bash
42+
vllm serve Qwen/Qwen3-Embedding-8B --task embed
43+
```
44+
45+
Once your server is started, you can query the model with input prompts
46+
47+
```bash
48+
curl http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{
49+
"model": "Qwen/Qwen3-Embedding-8B",
50+
"messages": [
51+
{"role": "user", "content": "Hello"}
52+
]
53+
}'
54+
```
55+
56+
### Offline Inference
57+
58+
```python
59+
import torch
60+
import vllm
61+
from vllm import LLM
62+
63+
def get_detailed_instruct(task_description: str, query: str) -> str:
64+
return f'Instruct: {task_description}\nQuery:{query}'
65+
66+
67+
if __name__=="__main__":
68+
# Each query must come with a one-sentence instruction that describes the task
69+
task = 'Given a web search query, retrieve relevant passages that answer the query'
70+
71+
queries = [
72+
get_detailed_instruct(task, 'What is the capital of China?'),
73+
get_detailed_instruct(task, 'Explain gravity')
74+
]
75+
# No need to add instruction for retrieval documents
76+
documents = [
77+
"The capital of China is Beijing.",
78+
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
79+
]
80+
input_texts = queries + documents
81+
82+
model = LLM(model="Qwen/Qwen3-Embedding-8B",
83+
task="embed",
84+
distributed_executor_backend="mp")
85+
86+
outputs = model.embed(input_texts)
87+
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
88+
scores = (embeddings[:2] @ embeddings[2:].T)
89+
print(scores.tolist())
90+
```
91+
92+
If you run this script successfully, you can see the info shown below:
93+
94+
```bash
95+
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 282.22it/s]
96+
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](VllmWorker rank=0 pid=4074750) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
97+
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 31.95it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
98+
[[0.7477798461914062, 0.07548339664936066], [0.0886271521449089, 0.6311039924621582]]
99+
```

0 commit comments

Comments
 (0)