Skip to content

Commit 2e5f312

Browse files
authored
Cleanup ununsed doc (#1352)
### What this PR does / why we need it? Cleanup ununsed doc for MoGE model, we will add back this when MoGE model ready. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
1 parent c30ddb8 commit 2e5f312

File tree

4 files changed

+1
-201
lines changed

4 files changed

+1
-201
lines changed

docs/source/tutorials/index.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@
66
single_npu
77
single_npu_multimodal
88
multi_npu
9-
multi_npu_moge
109
multi_npu_quantization
1110
single_node_300i
1211
multi_node

docs/source/tutorials/multi_npu_moge.md

Lines changed: 0 additions & 117 deletions
This file was deleted.

docs/source/tutorials/single_node_300i.md

Lines changed: 1 addition & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
4343

4444
### Online Inference on NPU
4545

46-
Run the following script to start the vLLM server on NPU(Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
46+
Run the following script to start the vLLM server on NPU(Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards):
4747

4848
:::::{tab-set}
4949
::::{tab-item} Qwen3-0.6B
@@ -90,30 +90,6 @@ python -m vllm.entrypoints.api_server \
9090
```
9191
::::
9292

93-
::::{tab-item} Pangu-Pro-MoE-72B
94-
95-
```{code-block} bash
96-
:substitutions:
97-
# Update the MODEL
98-
export MODEL="/path/to/pangu-pro-moe-model"
99-
export VLLM_USE_V1=1
100-
python -m vllm.entrypoints.api_server \
101-
--model $MODEL \
102-
--tensor-parallel-size 8 \
103-
--max-num-batched-tokens 2048 \
104-
--gpu-memory-utilization 0.5 \
105-
--max-num-seqs 4 \
106-
--enforce-eager \
107-
--trust-remote-code \
108-
--max-model-len 1024 \
109-
--disable-custom-all-reduce \
110-
--enable-expert-parallel \
111-
--dtype float16 \
112-
--port 8000 \
113-
--compilation-config '{"custom_ops":["+rms_norm", "+rotary_embedding"]}' \
114-
--additional-config '{"ascend_scheduler_config": {"enabled": true, "enable_chunked_prefill": false, "chunked_prefill_enabled": false}}'
115-
```
116-
::::
11793
:::::
11894

11995
Once your server is started, you can query the model with input prompts
@@ -237,61 +213,6 @@ clean_up()
237213

238214
::::
239215

240-
::::{tab-item} Pangu-72B-MoE
241-
```{code-block} python
242-
:substitutions:
243-
import gc
244-
import os
245-
import torch
246-
from vllm import LLM, SamplingParams
247-
from vllm.distributed.parallel_state import (destroy_distributed_environment,
248-
destroy_model_parallel)
249-
def clean_up():
250-
destroy_model_parallel()
251-
destroy_distributed_environment()
252-
gc.collect()
253-
torch.npu.empty_cache()
254-
os.environ["VLLM_USE_V1"] = "1"
255-
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
256-
if __name__ == "__main__":
257-
# Update the model_path
258-
model_path="/path/to/pangu-pro-moe-model"
259-
prompts = [
260-
"Hello, my name is",
261-
"The future of AI is",
262-
]
263-
sampling_params = SamplingParams(min_tokens=8, max_tokens=8, temperature=0.0)
264-
llm = LLM(model=model_path,
265-
tensor_parallel_size=8,
266-
max_num_batched_tokens=2048,
267-
gpu_memory_utilization=0.5,
268-
max_num_seqs=4,
269-
enforce_eager=True, # For 300I series, only eager mode is supported.
270-
trust_remote_code=True,
271-
max_model_len=1024,
272-
disable_custom_all_reduce=True, # IMPORTANT cause 300I series needed custom ops
273-
enable_expert_parallel=True,
274-
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
275-
compilation_config={"custom_ops":["+rms_norm", "+rotary_embedding"]}, # IMPORTANT cause 300I series needed custom ops
276-
additional_config = {
277-
'ascend_scheduler_config': {
278-
'enabled': True,
279-
'enable_chunked_prefill' : False,
280-
'chunked_prefill_enabled': False
281-
}
282-
}
283-
)
284-
# Generate texts from the prompts.
285-
outputs = llm.generate(prompts, sampling_params)
286-
for output in outputs:
287-
prompt = output.prompt
288-
generated_text = output.outputs[0].text
289-
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
290-
del llm
291-
clean_up()
292-
```
293-
294-
::::
295216
:::::
296217

297218
If you run this script successfully, you can see the info shown below:

docs/source/user_guide/release_notes.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,6 @@ This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the [
99
- Atlas 300I series is experimental supported in this release. [#1333](https://github.com/vllm-project/vllm-ascend/pull/1333) After careful consideration, this feature **will NOT be included in v0.9.1-dev branch** taking into account the v0.9.1 release quality and the feature rapid iteration to improve performance on Atlas 300I series. We will improve this from 0.9.2rc1 and later.
1010
- Support EAGLE-3 for speculative decoding. [#1032](https://github.com/vllm-project/vllm-ascend/pull/1032)
1111

12-
### Model
13-
- MoGE model is now supported. You can try with Pangu Pro Moe-72B on Atlas A2 series and Atlas 300I series. Please follow the official [tutorials](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_moge.html) and [300I series tutorials](https://vllm-ascend.readthedocs.io/en/latest/tutorials/single_node_300i.html). [#1204](https://github.com/vllm-project/vllm-ascend/pull/1204)
14-
1512
### Core
1613
- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.5.1.post1.dev20250528`. Don’t forget to update it in your environment. [#1235](https://github.com/vllm-project/vllm-ascend/pull/1235)
1714
- Support Atlas 300I series container image. You can get it from [quay.io](https://quay.io/repository/vllm/vllm-ascend)

0 commit comments

Comments
 (0)