|
| 1 | +# Multi-NPU (Pangu Pro MoE) |
| 2 | + |
| 3 | +## Run vllm-ascend on Multi-NPU |
| 4 | + |
| 5 | +Run container: |
| 6 | + |
| 7 | +```{code-block} bash |
| 8 | + :substitutions: |
| 9 | +# Update the vllm-ascend image |
| 10 | +export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| |
| 11 | +docker run --rm \ |
| 12 | +--name vllm-ascend \ |
| 13 | +--device /dev/davinci0 \ |
| 14 | +--device /dev/davinci1 \ |
| 15 | +--device /dev/davinci2 \ |
| 16 | +--device /dev/davinci3 \ |
| 17 | +--device /dev/davinci_manager \ |
| 18 | +--device /dev/devmm_svm \ |
| 19 | +--device /dev/hisi_hdc \ |
| 20 | +-v /usr/local/dcmi:/usr/local/dcmi \ |
| 21 | +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ |
| 22 | +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ |
| 23 | +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ |
| 24 | +-v /etc/ascend_install.info:/etc/ascend_install.info \ |
| 25 | +-v /root/.cache:/root/.cache \ |
| 26 | +-p 8000:8000 \ |
| 27 | +-it $IMAGE bash |
| 28 | +``` |
| 29 | + |
| 30 | +Setup environment variables: |
| 31 | + |
| 32 | +```bash |
| 33 | +# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory |
| 34 | +export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 |
| 35 | +``` |
| 36 | + |
| 37 | +Download the model: |
| 38 | + |
| 39 | +```bash |
| 40 | +git lfs install |
| 41 | +git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git |
| 42 | +``` |
| 43 | + |
| 44 | +### Online Inference on Multi-NPU |
| 45 | + |
| 46 | +Run the following script to start the vLLM server on Multi-NPU: |
| 47 | + |
| 48 | +```bash |
| 49 | +vllm serve /path/to/pangu-pro-moe-model \ |
| 50 | +--tensor-parallel-size 4 \ |
| 51 | +--trust-remote-code \ |
| 52 | +--enforce-eager |
| 53 | +``` |
| 54 | + |
| 55 | +Once your server is started, you can query the model with input prompts: |
| 56 | + |
| 57 | +```bash |
| 58 | +export question="你是谁?" |
| 59 | +curl http://localhost:8000/v1/completions \ |
| 60 | + -H "Content-Type: application/json" \ |
| 61 | + -d '{ |
| 62 | + "prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:", |
| 63 | + "max_tokens": 64, |
| 64 | + "top_p": 0.95, |
| 65 | + "top_k": 50, |
| 66 | + "temperature": 0.6 |
| 67 | + }' |
| 68 | +``` |
| 69 | + |
| 70 | +If you run this successfully, you can see the info shown below: |
| 71 | + |
| 72 | +```json |
| 73 | +{"id":"cmpl-2cd4223228ab4be9a91f65b882e65b32","object":"text_completion","created":1751255067,"model":"/root/.cache/pangu-pro-moe-model","choices":[{"index":0,"text":" [unused16] 好的,用户问我是谁,我需要根据之前的设定来回答。用户提到我是华为开发的“盘古Reasoner”,属于盘古大模型系列,作为智能助手帮助解答问题和提供 信息支持。现在用户再次询问,可能是在确认我的身份或者测试我的回答是否一致。\n\n首先,我要确保","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":15,"total_tokens":79,"completion_tokens":64,"prompt_tokens_details":null},"kv_transfer_params":null} |
| 74 | +``` |
| 75 | + |
| 76 | +### Offline Inference on Multi-NPU |
| 77 | + |
| 78 | +Run the following script to execute offline inference on multi-NPU: |
| 79 | + |
| 80 | +```python |
| 81 | +import gc |
| 82 | +from transformers import AutoTokenizer |
| 83 | +import torch |
| 84 | + |
| 85 | +from vllm import LLM, SamplingParams |
| 86 | +from vllm.distributed.parallel_state import (destroy_distributed_environment, |
| 87 | + destroy_model_parallel) |
| 88 | + |
| 89 | +def clean_up(): |
| 90 | + destroy_model_parallel() |
| 91 | + destroy_distributed_environment() |
| 92 | + gc.collect() |
| 93 | + torch.npu.empty_cache() |
| 94 | + |
| 95 | + |
| 96 | +if __name__ == "__main__": |
| 97 | + |
| 98 | + tokenizer = AutoTokenizer.from_pretrained("/path/to/pangu-pro-moe-model", trust_remote_code=True) |
| 99 | + tests = [ |
| 100 | + "Hello, my name is", |
| 101 | + "The future of AI is", |
| 102 | + ] |
| 103 | + prompts = [] |
| 104 | + for text in tests: |
| 105 | + messages = [ |
| 106 | + {"role": "system", "content": ""}, # Optionally customize system content |
| 107 | + {"role": "user", "content": text} |
| 108 | + ] |
| 109 | + prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # 推荐使用官方的template |
| 110 | + prompts.append(prompt) |
| 111 | + |
| 112 | + sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) |
| 113 | + |
| 114 | + llm = LLM(model="/path/to/pangu-pro-moe-model", |
| 115 | + tensor_parallel_size=4, |
| 116 | + distributed_executor_backend="mp", |
| 117 | + max_model_len=1024, |
| 118 | + trust_remote_code=True, |
| 119 | + enforce_eager=True) |
| 120 | + |
| 121 | + outputs = llm.generate(prompts, sampling_params) |
| 122 | + for output in outputs: |
| 123 | + prompt = output.prompt |
| 124 | + generated_text = output.outputs[0].text |
| 125 | + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
| 126 | + |
| 127 | + del llm |
| 128 | + clean_up() |
| 129 | +``` |
| 130 | + |
| 131 | +If you run this script successfully, you can see the info shown below: |
| 132 | + |
| 133 | +```bash |
| 134 | +Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I' |
| 135 | +Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the' |
| 136 | +``` |
0 commit comments