Skip to content

Commit c464c32

Browse files
authored
add doc for offline quantization inference (#1009)
add example for offline inference with quantized model Signed-off-by: 22dimensions <waitingwind@foxmail.com>
1 parent 05a4710 commit c464c32

File tree

1 file changed

+44
-1
lines changed

1 file changed

+44
-1
lines changed

docs/source/tutorials/multi_npu_quantization.md

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ The converted model files looks like:
6767
`-- tokenizer_config.json
6868
```
6969

70-
Run the following script to start the vLLM server with quantize model:
70+
Run the following script to start the vLLM server with quantized model:
7171

7272
:::{note}
7373
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
@@ -89,3 +89,46 @@ curl http://localhost:8000/v1/completions \
8989
"temperature": "0.0"
9090
}'
9191
```
92+
93+
Run the following script to execute offline inference on multi-NPU with quantized model:
94+
95+
:::{note}
96+
To enable quantization for ascend, quantization method must be "ascend"
97+
:::
98+
99+
```python
100+
import gc
101+
102+
import torch
103+
104+
from vllm import LLM, SamplingParams
105+
from vllm.distributed.parallel_state import (destroy_distributed_environment,
106+
destroy_model_parallel)
107+
108+
def clean_up():
109+
destroy_model_parallel()
110+
destroy_distributed_environment()
111+
gc.collect()
112+
torch.npu.empty_cache()
113+
114+
prompts = [
115+
"Hello, my name is",
116+
"The future of AI is",
117+
]
118+
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
119+
120+
llm = LLM(model="/home/models/QwQ-32B-w8a8",
121+
tensor_parallel_size=4,
122+
distributed_executor_backend="mp",
123+
max_model_len=4096,
124+
quantization="ascend")
125+
126+
outputs = llm.generate(prompts, sampling_params)
127+
for output in outputs:
128+
prompt = output.prompt
129+
generated_text = output.outputs[0].text
130+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
131+
132+
del llm
133+
clean_up()
134+
```

0 commit comments

Comments
 (0)