Skip to content

Commit 761bd3d

Browse files
authored
Add user guide for quantization (vllm-project#1206)
### What this PR does / why we need it? Add user guide for quantization ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: 22dimensions <waitingwind@foxmail.com>
1 parent 2c7dd85 commit 761bd3d

File tree

2 files changed

+107
-0
lines changed

2 files changed

+107
-0
lines changed

docs/source/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ user_guide/supported_models
4848
user_guide/env_vars
4949
user_guide/additional_config
5050
user_guide/graph_mode.md
51+
user_guide/quantization.md
5152
user_guide/release_notes
5253
:::
5354

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Quantization Guide
2+
3+
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
4+
5+
Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.
6+
7+
## Install modelslim
8+
9+
To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
10+
11+
Currently, only the specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim works with vLLM Ascend. Please do not install other version until modelslim master version is avaliable for vLLM Ascend in the future.
12+
13+
Install modelslim:
14+
```bash
15+
git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
16+
cd msit/msmodelslim
17+
bash install.sh
18+
pip install accelerate
19+
```
20+
21+
## Quantize model
22+
23+
Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, you just need to download the model, and then execute the convert command. The command is shown below. More info can be found in modelslim doc [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96).
24+
25+
26+
```bash
27+
cd example/DeepSeek
28+
python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8 --is_dynamic True
29+
```
30+
31+
:::{note}
32+
You can also download the quantized model that we uploaded. Please note that these weights should be used for test only. For example, https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8
33+
:::
34+
35+
Once convert action is done, there are two important files generated.
36+
37+
1. [confg.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Please make sure that there is no `quantization_config` field in it.
38+
39+
2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). All the converted weights info are recorded in this file.
40+
41+
Here is the full converted model files:
42+
```bash
43+
.
44+
├── config.json
45+
├── configuration_deepseek.py
46+
├── configuration.json
47+
├── generation_config.json
48+
├── quant_model_description.json
49+
├── quant_model_weight_w8a8_dynamic-00001-of-00004.safetensors
50+
├── quant_model_weight_w8a8_dynamic-00002-of-00004.safetensors
51+
├── quant_model_weight_w8a8_dynamic-00003-of-00004.safetensors
52+
├── quant_model_weight_w8a8_dynamic-00004-of-00004.safetensors
53+
├── quant_model_weight_w8a8_dynamic.safetensors.index.json
54+
├── README.md
55+
├── tokenization_deepseek_fast.py
56+
├── tokenizer_config.json
57+
└── tokenizer.json
58+
```
59+
60+
## Run the model
61+
62+
Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.
63+
64+
### Offline inference
65+
66+
```python
67+
import torch
68+
69+
from vllm import LLM, SamplingParams
70+
71+
prompts = [
72+
"Hello, my name is",
73+
"The future of AI is",
74+
]
75+
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
76+
77+
llm = LLM(model="{quantized_model_save_path}",
78+
max_model_len=2048,
79+
trust_remote_code=True,
80+
# Enable quantization by specifing `quantization="ascend"`
81+
quantization="ascend")
82+
83+
outputs = llm.generate(prompts, sampling_params)
84+
for output in outputs:
85+
prompt = output.prompt
86+
generated_text = output.outputs[0].text
87+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
88+
```
89+
90+
### Online inference
91+
92+
```bash
93+
# Enable quantization by specifing `--quantization ascend`
94+
vllm serve {quantized_model_save_path} --served-model-name "deepseek-v2-lite-w8a8" --max-model-len 2048 --quantization ascend --trust-remote-code
95+
```
96+
97+
## FAQs
98+
99+
### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?
100+
101+
First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `modelslim-VLLM-8.1.RC1.b020_001` modelslim version. Finally, if it still doesn't work, please
102+
submit a issue, maybe some new models need to be adapted.
103+
104+
### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
105+
106+
Please convert DeepSeek series models using `modelslim-VLLM-8.1.RC1.b020_001` modelslim, this version has fixed the missing configuration_deepseek.py error.

0 commit comments

Comments
 (0)