|
1 |
| -# Multi-NPU (deepseek-v2-lite-w8a8) |
| 1 | +# Multi-NPU (QwQ 32B W8A8) |
2 | 2 |
|
3 | 3 | ## Run docker container:
|
4 | 4 | :::{note}
|
@@ -31,60 +31,54 @@ docker run --rm \
|
31 | 31 | ## Install modelslim and convert model
|
32 | 32 | :::{note}
|
33 | 33 | You can choose to convert the model yourself or use the quantized model we uploaded,
|
34 |
| -see https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-w8a8 |
| 34 | +see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8 |
35 | 35 | :::
|
36 | 36 |
|
37 | 37 | ```bash
|
38 |
| -git clone https://gitee.com/ascend/msit |
| 38 | +# (Optional)This tag is recommended and has been verified |
| 39 | +git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020 |
39 | 40 |
|
40 |
| -# (Optional)This commit has been verified |
41 |
| -git checkout a396750f930e3bd2b8aa13730401dcbb4bc684ca |
42 | 41 | cd msit/msmodelslim
|
43 | 42 | # Install by run this script
|
44 | 43 | bash install.sh
|
45 | 44 | pip install accelerate
|
46 | 45 |
|
47 |
| -cd /msit/msmodelslim/example/DeepSeek |
| 46 | +cd example/Qwen |
48 | 47 | # Original weight path, Replace with your local model path
|
49 |
| -MODEL_PATH=/home/weight/DeepSeek-V2-Lite |
| 48 | +MODEL_PATH=/home/models/QwQ-32B |
50 | 49 | # Path to save converted weight, Replace with your local path
|
51 |
| -SAVE_PATH=/home/weight/DeepSeek-V2-Lite-w8a8 |
52 |
| -mkdir -p $SAVE_PATH |
| 50 | +SAVE_PATH=/home/models/QwQ-32B-w8a8 |
| 51 | + |
53 | 52 | # In this conversion process, the npu device is not must, you can also set --device_type cpu to have a conversion
|
54 |
| -python3 quant_deepseek.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --device_type npu --act_method 2 --w_bit 8 --a_bit 8 --is_dynamic True |
| 53 | +python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --calib_file ../common/boolq.jsonl --w_bit 8 --a_bit 8 --device_type npu --anti_method m1 --trust_remote_code True |
55 | 54 | ```
|
56 | 55 |
|
57 | 56 | ## Verify the quantized model
|
58 | 57 | The converted model files looks like:
|
59 | 58 | ```bash
|
60 | 59 | .
|
61 | 60 | |-- config.json
|
62 |
| -|-- configuration_deepseek.py |
63 |
| -|-- fusion_result.json |
| 61 | +|-- configuration.json |
64 | 62 | |-- generation_config.json
|
65 |
| -|-- quant_model_description_w8a8_dynamic.json |
66 |
| -|-- quant_model_weight_w8a8_dynamic-00001-of-00004.safetensors |
67 |
| -|-- quant_model_weight_w8a8_dynamic-00002-of-00004.safetensors |
68 |
| -|-- quant_model_weight_w8a8_dynamic-00003-of-00004.safetensors |
69 |
| -|-- quant_model_weight_w8a8_dynamic-00004-of-00004.safetensors |
70 |
| -|-- quant_model_weight_w8a8_dynamic.safetensors.index.json |
71 |
| -|-- tokenization_deepseek_fast.py |
| 63 | +|-- quant_model_description.json |
| 64 | +|-- quant_model_weight_w8a8.safetensors |
| 65 | +|-- README.md |
72 | 66 | |-- tokenizer.json
|
73 | 67 | `-- tokenizer_config.json
|
74 | 68 | ```
|
75 | 69 |
|
76 | 70 | Run the following script to start the vLLM server with quantize model:
|
77 | 71 | ```bash
|
78 |
| -vllm serve /home/weight/DeepSeek-V2-Lite-w8a8 --tensor-parallel-size 4 --trust-remote-code --served-model-name "dpsk-w8a8" --max-model-len 4096 |
| 72 | +vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend |
79 | 73 | ```
|
80 | 74 |
|
81 | 75 | Once your server is started, you can query the model with input prompts
|
82 | 76 | ```bash
|
83 | 77 | curl http://localhost:8000/v1/completions \
|
84 | 78 | -H "Content-Type: application/json" \
|
85 | 79 | -d '{
|
86 |
| - "model": "dpsk-w8a8", |
87 |
| - "prompt": "what is deepseek?", |
| 80 | + "model": "qwq-32b-w8a8", |
| 81 | + "prompt": "what is large language model?", |
88 | 82 | "max_tokens": "128",
|
89 | 83 | "top_p": "0.95",
|
90 | 84 | "top_k": "40",
|
|
0 commit comments