v0.7.3
🎉 Hello, World!
We are excited to announce the release of 0.7.3 for vllm-ascend. This is the first official release. The functionality, performance, and stability of this release are fully tested and verified. We encourage you to try it out and provide feedback. We'll post bug fix versions in the future if needed. Please follow the official doc to start the journey.
Highlights
- This release includes all features landed in the previous release candidates (v0.7.1rc1, v0.7.3rc1, v0.7.3rc2). And all the features are fully tested and verified. Visit the official doc the get the detail feature and model support matrix.
- Upgrade CANN to 8.1.RC1 to enable chunked prefill and automatic prefix caching features. You can now enable them now.
- Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automaticlly. #662
- Integrate MindIE Turbo into vLLM Ascend to improve DeepSeek V3/R1, Qwen 2 series performance. #708
Core
- LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. #700
Model
- The performance of Qwen2 vl and Qwen2.5 vl is improved. #702
- The performance of
apply_penalties
andtopKtopP
ops are improved. #525
Other
- Fixed a issue that may lead CPU memory leak. #691 #712
- A new environment
SOC_VERSION
is added. If you hit any soc detection erro when building with custom ops enabled, please setSOC_VERSION
to a suitable value. #606 - openEuler container image supported with v0.7.3-openeuler tag. #665
- Prefix cache feature works on V1 engine now. #559