Releases: vllm-project/vllm-ascend
v0.9.2rc1
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the official doc to get started. From this release, V1 engine will be enabled by default, there is no need to set VLLM_USE_V1=1
any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.
Highlights
- Pooling model works with V1 engine now. You can take a try with Qwen3 embedding model #1359.
- The performance on Atlas 300I series has been improved. #1591
- aclgraph mode works with Moe models now. Currently, only Qwen3 Moe is well tested. #1381
Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to
2.5.1.post1.dev20250619
. Don’t forget to update it in your environment. #1347 - The GatherV3 error has been fixed with aclgraph mode. #1416
- W8A8 quantization works on Atlas 300I series now. #1560
- Fix the accuracy problem with deploy models with parallel parameters. #1678
- The pre-built wheel package now requires lower version of glibc. Users can use it by
pip install vllm-ascend
directly. #1582
Other
- Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. #1331
- A new env variable
VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP
has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is0
. #1335 - A new env variable
VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION
has been added to improve the performance of topk-topp sampling. The default value is 0, we'll consider to enable it by default in the future#1732 - A batch of bugs have been fixed for Data Parallelism case #1273 #1322 #1275 #1478
- The DeepSeek performance has been improved. #1194 #1395 #1380
- Ascend scheduler works with prefix cache now. #1446
- DeepSeek now works with prefix cache now. #1498
- Support prompt logprobs to recover ceval accuracy in V1 #1483
Knowissue
New Contributors
- @xleoken made their first contribution in #1357
- @lyj-jjj made their first contribution in #1335
- @sharonyunyun made their first contribution in #1194
- @Pr0Wh1teGivee made their first contribution in #1308
- @leo-pony made their first contribution in #1374
- @zeshengzong made their first contribution in #1452
- @GDzhu01 made their first contribution in #1477
- @Agonixiaoxiao made their first contribution in #1531
- @zhanghw0354 made their first contribution in #1476
- @farawayboat made their first contribution in #1591
- @ZhengWG made their first contribution in #1196
- @wm901115nwpu made their first contribution in #1654
Full Changelog: v0.9.1rc1...v0.9.2rc1
v0.9.1rc1
This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.
Experimental
- Atlas 300I series is experimental supported in this release (Functional test passed with Qwen2.5-7b-instruct/Qwen2.5-0.5b/Qwen3-0.6B/Qwen3-4B/Qwen3-8B). #1333
- Support EAGLE-3 for speculative decoding. #1032
After careful consideration, above features will NOT be included in v0.9.1-dev branch (v0.9.1 final release) taking into account the v0.9.1 release quality and the feature rapid iteration. We will improve this from 0.9.2rc1 and later.
Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to
2.5.1.post1.dev20250528
. Don’t forget to update it in your environment. #1235 - Support Atlas 300I series container image. You can get it from quay.io
- Fix token-wise padding mechanism to make multi-card graph mode work. #1300
- Upgrade vLLM to 0.9.1 [#1165]#1165
Other Improvements
- Initial support Chunked Prefill for MLA. #1172
- An example of best practices to run DeepSeek with ETP has been added. #1101
- Performance improvements for DeepSeek using the TorchAir graph. #1098, #1131
- Supports the speculative decoding feature with AscendScheduler. #943
- Improve
VocabParallelEmbedding
custom op performance. It will be enabled in the next release. #796 - Fixed a device discovery and setup bug when running vLLM Ascend on Ray #884
- DeepSeek with MC2 (Merged Compute and Communication) now works properly. #1268
- Fixed log2phy NoneType bug with static EPLB feature. #1186
- Improved performance for DeepSeek with DBO enabled. #997, #1135
- Refactoring AscendFusedMoE #1229
- Add initial user stories page (include LLaMA-Factory/TRL/verl/MindIE Turbo/GPUStack) #1224
- Add unit test framework #1201
Known Issues
- In some cases, the vLLM process may crash with a GatherV3 error when aclgraph is enabled. We are working on this issue and will fix it in the next release. #1038
- Prefix cache feature does not work with the Ascend Scheduler but without chunked prefill enabled. This will be fixed in the next release. #1350
Full Changelog
New Contributors
- @farawayboat made their first contribution in #1333
- @yzim made their first contribution in #1159
- @chenwaner made their first contribution in #1098
- @wangyanhui-cmss made their first contribution in #1184
- @songshanhu07 made their first contribution in #1186
- @yuancaoyaoHW made their first contribution in #1032
Full Changelog: v0.9.0rc2...v0.9.1rc1
v0.9.0rc2
This is the 2nd official release candidate of v0.9.0 for vllm-ascend. Please follow the official doc to start the journey. From this release, V1 Engine is recommended to use. The code of V0 Engine is frozen and will not be maintained any more. Please set environment VLLM_USE_V1=1
to enable V1 Engine.
Highlights
- DeepSeek works with graph mode now. Follow the official doc to take a try. #789
- Qwen series models works with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set
enforce_eager=True
when initializing the model.
Core
- The performance of multi-step scheduler has been improved. Thanks for the contribution from China Merchants Bank. #814
- LoRA、Multi-LoRA And Dynamic Serving is supported for V1 Engine now. Thanks for the contribution from China Merchants Bank. #893
- prefix cache and chunked prefill feature works now #782 #844
- Spec decode and MTP features work with V1 Engine now. #874 #890
- DP feature works with DeepSeek now. #1012
- Input embedding feature works with V0 Engine now. #916
- Sleep mode feature works with V1 Engine now. #1084
Model
- Qwen2.5 VL works with V1 Engine now. #736
- LLama4 works now. #740
- A new kind of DeepSeek model called dual-batch overlap(DBO) is added. Please set
VLLM_ASCEND_ENABLE_DBO=1
to use it. #941
Other
- online serve with ascend quantization works now. #877
- A batch of bugs for graph mode and moe model have been fixed. #773 #771 #774 #816 #817 #819 #912 #897 #961 #958 #913 #905
- A batch of performance improvement PRs have been merged. #784 #803 #966 #839 #970 #947 #987 #1085
- From this release, binary wheel package will be released as well. #775
- The contributor doc site is added
Known Issue
- In some case, vLLM process may be crashed with aclgraph enabled. We're working this issue and it'll be fixed in the next release. #1038
- Multi node data-parallel doesn't work with this release. This is a known issue in vllm and has been fixed on main branch. #18981
New Contributors
- @chris668899 made their first contribution in #771
- @NeverRaR made their first contribution in #789
- @cxcxflying made their first contribution in #740
- @22dimensions made their first contribution in #835
- @wonderful199082 made their first contribution in #814
- @yangpuPKU made their first contribution in #937
- @ttanzhiqiang made their first contribution in #909
- @ponix-j made their first contribution in #874
- @XWFAlone made their first contribution in #890
- @NINGBENZHE made their first contribution in #896
- @momo609 made their first contribution in #970
- @David9857 made their first contribution in #947
- @depeng1994 made their first contribution in #1013
- @hahazhky made their first contribution in #987
- @weijinqian0 made their first contribution in #1067
- @sdmyzlp made their first contribution in #1091
- @zxdukki made their first contribution in #941
- @ChenTaoyu-SJTU made their first contribution in #736
- @Yuxiao-Xu made their first contribution in #1116
v0.9.0rc1
Just a pre release for 0.9.0. There are still some known bug in this release
v0.7.3.post1
This is the first post release of 0.7.3. Please follow the official doc to start the journey. It includes the following changes:
Highlights
- Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 is well tested. You can try it now. Mindie Turbo is recomanded to improve the performance of Qwen3. #903 #915
- Added a new performance guide. The guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. #878 Doc Link
Bug Fix
- Qwen2.5-VL works for RLHF scenarios now. #928
- Users can launch the model from online weights now. e.g. from huggingface or modelscope directly #858 #918
- The meaningless log info
UserWorkspaceSize0
has been cleaned. #911 - The log level for
Failed to import vllm_ascend_C
has been changed towarning
instead oferror
. #956 - DeepSeek MLA now works with chunked prefill in V1 Engine. Please note that V1 engine in 0.7.3 is just expermential and only for test usage. #849 #936
Docs
v0.7.3
🎉 Hello, World!
We are excited to announce the release of 0.7.3 for vllm-ascend. This is the first official release. The functionality, performance, and stability of this release are fully tested and verified. We encourage you to try it out and provide feedback. We'll post bug fix versions in the future if needed. Please follow the official doc to start the journey.
Highlights
- This release includes all features landed in the previous release candidates (v0.7.1rc1, v0.7.3rc1, v0.7.3rc2). And all the features are fully tested and verified. Visit the official doc the get the detail feature and model support matrix.
- Upgrade CANN to 8.1.RC1 to enable chunked prefill and automatic prefix caching features. You can now enable them now.
- Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automaticlly. #662
- Integrate MindIE Turbo into vLLM Ascend to improve DeepSeek V3/R1, Qwen 2 series performance. #708
Core
- LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. #700
Model
- The performance of Qwen2 vl and Qwen2.5 vl is improved. #702
- The performance of
apply_penalties
andtopKtopP
ops are improved. #525
Other
- Fixed a issue that may lead CPU memory leak. #691 #712
- A new environment
SOC_VERSION
is added. If you hit any soc detection erro when building with custom ops enabled, please setSOC_VERSION
to a suitable value. #606 - openEuler container image supported with v0.7.3-openeuler tag. #665
- Prefix cache feature works on V1 engine now. #559
v0.8.5rc1
This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the official doc to start the journey.
Experimental: Now you can enable V1 egnine by setting the environment variable VLLM_USE_V1=1
, see the feature support status of vLLM Ascend in here.
Highlights
- Upgrade CANN version to 8.1.RC1 to support chunked prefill and automatic prefix caching (
--enable_prefix_caching
) when V1 is enabled #747 - Optimize Qwen2 VL and Qwen 2.5 VL #701
- Improve Deepseek V3 eager mode and graph mode performance, now you can use
--additional_config={'enable_graph_mode': True}
to enable graph mode. #598 #731
Core
- Upgrade vLLM to 0.8.5.post1 #715
- Fix early return in CustomDeepseekV2MoE.forward during profile_run #682
- Adapts for new quant model generated by modelslim #719
- Initial support on P2P Disaggregated Prefill based on llm_datadist #694
- Use
/vllm-workspace
as code path and include.git
in container image to fix issue when start vllm under/workspace
#726 - Optimize NPU memory usage to make DeepSeek R1 W8A8 32K model len work. #728
- Fix
PYTHON_INCLUDE_PATH
typo in setup.py #762
Other
Known issue
- If you are running the DeepSeek with
VLLM_USE_V1=1
enabled will encountercall aclnnInplaceCopy failed
, Please refer #778 to fix.
v0.8.4rc2
This is the second release candidate of v0.8.4 for vllm-ascend. Please follow the official doc to start the journey. Some experimental features are included in this version, such as W8A8 quantization and EP/DP support. We'll make them stable enough in the next release.
Highlights
- Qwen3 and Qwen3MOE is supported now. Please follow the official doc to run the quick demo. #709
- Ascend W8A8 quantization method is supported now. Please take the official doc for example. Any feedback is welcome. #580
- DeepSeek V3/R1 works with DP, TP and MTP now. Please note that it's still in experimental status. Let us know if you hit any problem. #429 #585 #626 #636 #671
Core
- Torch.compile feature is supported with V1 engine now. It's disabled by default because this feature rely on CANN 8.1 release. We'll make it avaiable by default in the next release #426
- Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automaticlly. #661
Other
- MiniCPM model works now. #645
- openEuler container image supported with
v0.8.4-openeuler
tag and customs Ops build is enabled by default for openEuler OS. #689 - Fix ModuleNotFoundError bug to make Lora work #600
- Add "Using EvalScope evaluation" doc #611
- Add a
VLLM_VERSION
environment to make vLLM version configurable to help developer set correct vLLM version if the code of vLLM is changed by hand locally. #651
v0.8.4rc1
This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the official doc to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the official documentation.
Highlights
- vLLM V1 engine experimental support is included in this version. You can visit official guide to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set
VLLM_USE_V1=1
environment if you want to use V1 forcely. - LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. #521.
- Sleep Mode feature is supported. Currently it's only work on V0 engine. V1 engine support will come soon. #513
Core
- The Ascend scheduler is added for V1 engine. This scheduler is more affinity with Ascend hardware. More scheduler policy will be added in the future. #543
- Disaggregated Prefill feature is supported. Currently only 1P1D works. NPND is under design by vllm team. vllm-ascend will support it once it's ready from vLLM. Follow the official guide to use. #432
- Spec decode feature works now. Currently it's only work on V0 engine. V1 engine support will come soon. #500
- Structured output feature works now on V1 Engine. Currently it only supports xgrammar backend while using guidance backend may get some errors. #555
Other
- A new communicator
pyhccl
is added. It's used for call CANN HCCL library directly instead of usingtorch.distribute
. More usage of it will be added in the next release #503 - The custom ops build is enabled by default. You should install the packages like
gcc
,cmake
first to buildvllm-ascend
from source. SetCOMPILE_CUSTOM_KERNELS=0
environment to disable the compilation if you don't need it. #466 - The custom op
rotay embedding
is enabled by default now to improve the performance. #555
v0.7.3rc2
This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.
- Quickstart with container: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/quick_start.html
- Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html
Highlights
- Add Ascend Custom Ops framewrok. Developers now can write customs ops using AscendC. An example ops
rotary_embedding
is added. More tutorials will come soon. The Custome Ops complation is disabled by default when installing vllm-ascend. SetCOMPILE_CUSTOM_KERNELS=1
to enable it. #371 - V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us here. #376
- Prefix cache feature works now. You can set
enable_prefix_caching=True
to enable it. #282
Core
- Bump torch_npu version to dev20250320.3 to improve accuracy to fix
!!!
output problem. #406
Model
- The performance of Qwen2-vl is improved by optimizing patch embedding (Conv3D). #398