Skip to content

Commit e4df0a4

Browse files
authored
Add Pangu MoE Pro for 300I series docs (#1516)
### What this PR does / why we need it? Add Pangu MoE Pro for 300I series docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
1 parent cad4c69 commit e4df0a4

File tree

1 file changed

+146
-17
lines changed

1 file changed

+146
-17
lines changed

docs/source/tutorials/single_node_300i.md

Lines changed: 146 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Single Node (Atlas 300I series)
22

3+
```{note}
4+
This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement.
5+
```
6+
37
## Run vLLM on Altlas 300I series
48

59
Run docker container:
@@ -43,10 +47,16 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
4347

4448
### Online Inference on NPU
4549

46-
Run the following script to start the vLLM server on NPU(Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards):
50+
Run the following script to start the vLLM server on NPU(Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
4751

4852
:::::{tab-set}
53+
:sync-group: inference
54+
4955
::::{tab-item} Qwen3-0.6B
56+
:selected:
57+
:sync: qwen0.6
58+
59+
Run the following command to start the vLLM server:
5060

5161
```{code-block} bash
5262
:substitutions:
@@ -66,9 +76,24 @@ python -m vllm.entrypoints.api_server \
6676
--port 8000 \
6777
--compilation-config '{"custom_ops":["+rms_norm", "+rotary_embedding"]}'
6878
```
79+
80+
Once your server is started, you can query the model with input prompts
81+
82+
```bash
83+
curl http://localhost:8000/generate \
84+
-H "Content-Type: application/json" \
85+
-d '{
86+
"prompt": "Hello, my name is ?",
87+
"max_tokens": 20,
88+
"temperature": 0
89+
}'
90+
```
6991
::::
7092

7193
::::{tab-item} Qwen/Qwen2.5-7B-Instruct
94+
:sync: qwen7b
95+
96+
Run the following command to start the vLLM server:
7297

7398
```{code-block} bash
7499
:substitutions:
@@ -88,9 +113,6 @@ python -m vllm.entrypoints.api_server \
88113
--port 8000 \
89114
--compilation-config '{"custom_ops":["+rms_norm", "+rotary_embedding"]}'
90115
```
91-
::::
92-
93-
:::::
94116

95117
Once your server is started, you can query the model with input prompts
96118

@@ -104,38 +126,79 @@ curl http://localhost:8000/generate \
104126
}'
105127
```
106128

107-
If you run this script successfully, you can see the info shown below:
129+
::::
130+
131+
::::{tab-item} Pangu-Pro-MoE-72B
132+
:sync: pangu
133+
134+
Download the model:
108135

109136
```bash
110-
{"text":["The future of AI is ? \nA. 充满希望的 \nB. 不确定的 \nC. 危险的 \nD. 无法预测的 \n答案:A \n解析:"]}
137+
git lfs install
138+
git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
139+
```
140+
141+
Run the following command to start the vLLM server:
142+
143+
```{code-block} bash
144+
:substitutions:
145+
146+
VLLM_USE_V1=1 vllm serve /home/pangu-pro-moe-mode/ \
147+
--tensor-parallel-size 4 \
148+
--enable-expert-parallel \
149+
--dtype "float16" \
150+
--trust-remote-code \
151+
--enforce-eager
152+
153+
```
154+
155+
Once your server is started, you can query the model with input prompts
156+
157+
```bash
158+
export question="你是谁?"
159+
curl http://localhost:8000/v1/completions \
160+
-H "Content-Type: application/json" \
161+
-d '{
162+
"prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:",
163+
"max_tokens": 64,
164+
"top_p": 0.95,
165+
"top_k": 50,
166+
"temperature": 0.6
167+
}'
111168
```
112169

170+
::::
171+
:::::
172+
173+
If you run this script successfully, you can see the results.
174+
113175
### Offline Inference
114176

115-
Run the following script to execute offline inference on NPU:
177+
Run the following script (`example.py`) to execute offline inference on NPU:
116178

117179
:::::{tab-set}
180+
:sync-group: inference
181+
118182
::::{tab-item} Qwen3-0.6B
183+
:selected:
184+
:sync: qwen0.6
119185

120186
```{code-block} python
121187
:substitutions:
122188
from vllm import LLM, SamplingParams
123189
import gc
124-
import os
125190
import torch
126191
from vllm import LLM, SamplingParams
127192
from vllm.distributed.parallel_state import (destroy_distributed_environment,
128193
destroy_model_parallel)
129-
os.environ["VLLM_USE_V1"] = "1"
194+
130195
def clean_up():
131196
destroy_model_parallel()
132197
destroy_distributed_environment()
133198
gc.collect()
134199
torch.npu.empty_cache()
135200
prompts = [
136201
"Hello, my name is",
137-
"The president of the United States is",
138-
"The capital of France is",
139202
"The future of AI is",
140203
]
141204
# Create a sampling params object.
@@ -165,26 +228,24 @@ clean_up()
165228
::::
166229

167230
::::{tab-item} Qwen2.5-7B-Instruct
231+
:sync: qwen7b
168232

169233
```{code-block} python
170234
:substitutions:
171235
from vllm import LLM, SamplingParams
172236
import gc
173-
import os
174237
import torch
175238
from vllm import LLM, SamplingParams
176239
from vllm.distributed.parallel_state import (destroy_distributed_environment,
177240
destroy_model_parallel)
178-
os.environ["VLLM_USE_V1"] = "1"
241+
179242
def clean_up():
180243
destroy_model_parallel()
181244
destroy_distributed_environment()
182245
gc.collect()
183246
torch.npu.empty_cache()
184247
prompts = [
185248
"Hello, my name is",
186-
"The president of the United States is",
187-
"The capital of France is",
188249
"The future of AI is",
189250
]
190251
# Create a sampling params object.
@@ -213,13 +274,81 @@ clean_up()
213274

214275
::::
215276

277+
::::{tab-item} Pangu-Pro-MoE-72B
278+
:sync: pangu
279+
280+
Download the model:
281+
282+
```bash
283+
git lfs install
284+
git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
285+
```
286+
287+
```{code-block} python
288+
:substitutions:
289+
290+
import gc
291+
from transformers import AutoTokenizer
292+
import torch
293+
294+
from vllm import LLM, SamplingParams
295+
from vllm.distributed.parallel_state import (destroy_distributed_environment,
296+
destroy_model_parallel)
297+
298+
def clean_up():
299+
destroy_model_parallel()
300+
destroy_distributed_environment()
301+
gc.collect()
302+
torch.npu.empty_cache()
303+
304+
305+
if __name__ == "__main__":
306+
307+
tokenizer = AutoTokenizer.from_pretrained("/home/pangu-pro-moe-mode/", trust_remote_code=True)
308+
tests = [
309+
"Hello, my name is",
310+
"The future of AI is",
311+
]
312+
prompts = []
313+
for text in tests:
314+
messages = [
315+
{"role": "system", "content": ""}, # Optionally customize system content
316+
{"role": "user", "content": text}
317+
]
318+
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # 推荐使用官方的template
319+
prompts.append(prompt)
320+
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
321+
322+
llm = LLM(model="/home/pangu-pro-moe-mode/",
323+
tensor_parallel_size=8,
324+
distributed_executor_backend="mp",
325+
enable_expert_parallel=True,
326+
dtype="float16",
327+
max_model_len=1024,
328+
trust_remote_code=True,
329+
enforce_eager=True)
330+
331+
outputs = llm.generate(prompts, sampling_params)
332+
for output in outputs:
333+
prompt = output.prompt
334+
generated_text = output.outputs[0].text
335+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
336+
337+
del llm
338+
clean_up()
339+
```
340+
341+
::::
216342
:::::
217343

344+
Run script:
345+
```bash
346+
VLLM_USE_V1=1 python example.py
347+
```
348+
218349
If you run this script successfully, you can see the info shown below:
219350

220351
```bash
221352
Prompt: 'Hello, my name is', Generated text: " Lina. I'm a 22-year-old student from China. I'm interested in studying in the US. I'm looking for a job in the US. I want to know if there are any opportunities in the US for me to work. I'm also interested in the culture and lifestyle in the US. I want to know if there are any opportunities for me to work in the US. I'm also interested in the culture and lifestyle in the US. I'm interested in the culture"
222-
Prompt: 'The president of the United States is', Generated text: ' the same as the president of the United Nations. This is because the president of the United States is the same as the president of the United Nations. The president of the United States is the same as the president of the United Nations. The president of the United States is the same as the president of the United Nations. The president of the United States is the same as the president of the United Nations. The president of the United States is the same as the president of the United Nations. The president'
223-
Prompt: 'The capital of France is', Generated text: ' Paris. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of China is Beijing. The capital of Japan is Tokyo. The capital of India is New Delhi. The capital of Brazil is Brasilia. The capital of Egypt is Cairo. The capital of South Africa is Cape Town. The capital of Nigeria is Abuja. The capital of Lebanon is Beirut. The capital of Morocco is Rabat. The capital of Indonesia is Jakarta. The capital of Peru is Lima. The'
224353
Prompt: 'The future of AI is', Generated text: " not just about the technology itself, but about how we use it to solve real-world problems. As AI continues to evolve, it's important to consider the ethical implications of its use. AI has the potential to bring about significant changes in society, but it also has the power to create new challenges. Therefore, it's crucial to develop a comprehensive approach to AI that takes into account both the benefits and the risks associated with its use. This includes addressing issues such as bias, privacy, and accountability."
225354
```

0 commit comments

Comments
 (0)