Releases · vllm-project/vllm-ascend

29 Sep 19:37

wangxiyuan

v0.11.0rc0

00ba071

v0.11.0rc0 Pre-release

Pre-release

This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow the official doc to get started.

Highlights

DeepSeek V3.2 is supported now. #3270 Please follow the official guide to take a try.
Qwen3-vl is supported now. #3103

Core

DeepSeek works with aclgraph now. #2707
MTP works with aclgraph now. #2932
EPLB is supported now. #2956
Mooncacke store kvcache connector is supported now. #2913
CPU offload connector is supported now. #1659

Other

Qwen3-next is stable now. #3007
Fixed a lot of bugs introduced in v0.10.2 by Qwen3-next. #2964 #2781 #3070 #3113
The LoRA feature is back now. #3044
Eagle3 spec decode method is back now. #2949

New Contributors

@offline893 made their first contribution in #2956
@1Fire4 made their first contribution in #2869
@jesse996 made their first contribution in #2796
@Lucaskabela made their first contribution in #2969
@qyqc731 made their first contribution in #2962
@Mercykid-bash made their first contribution in #3042
@MaoJianwei made their first contribution in #3116
@booker123456 made their first contribution in #3071
@Csrayz made their first contribution in #2372
@Clorist33 made their first contribution in #3035
@clrs97 made their first contribution in #2931
@zzhx1 made their first contribution in #3027
@mfyCn-1204 made their first contribution in #3123
@dragondream-chen made their first contribution in #3132
@florenceCH made their first contribution in #3126
@slippersss made their first contribution in #3153
@socrahow made their first contribution in #3151

Full Changelog: v0.10.2rc1...v0.11.0rc0

Contributors

MaoJianwei, jesse996, and 15 other contributors

Assets 2

15 Sep 17:22

wangxiyuan

v0.10.2rc1

048bfd5

v0.10.2rc1 Pre-release

Pre-release

This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the official doc to get started.

Highlights

Add support for Qwen3 Next. Please note that expert parallel and MTP feature doesn't work with this release. We'll make it work enough soon. Follow the official guide to get start #2917
Add quantization support for aclgraph #2841

Core

Aclgraph now works with Ray backend. #2589
MTP now works with the token > 1. #2708
Qwen2.5 VL now works with quantization. #2778
Improved the performance with async scheduler enabled. #2783
Fixed the performance regression with non MLA model when use default scheduler. #2894

Other

The performance of w8a8 quantization is improved. #2275
The performance of moe model is improved. #2689 #2842
Fixed resources limit error when apply speculative decoding and aclgraph. #2472
Fixed the git config error in docker images. #2746
Fixed the sliding windows attention bug with prefill. #2758
The official doc for Prefill Decode Disaggregation with Qwen3 is added. #2751
VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP env works again. #2740
A new improvement for oproj in deepseek is added. Set oproj_tensor_parallel_size to enable this feature#2167
Fix a bug that deepseek with torchair doesn't work as expect when graph_batch_sizes is set. #2760
Avoid duplicate generation of sin_cos_cache in rope when kv_seqlen > 4k. #2744
The performance of Qwen3 dense model is improved with flashcomm_v1. Set VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1 and VLLM_ASCEND_ENABLE_FLASHCOMM=1 to enable it. #2779
The performance of Qwen3 dense model is improved with prefetch feature. Set VLLM_ASCEND_ENABLE_PREFETCH_MLP=1 to enable it. #2816
The performance of Qwen3 MoE model is improved with rope ops update. #2571
Fix the weight load error for RLHF case. #2756
Add warm_up_atb step to speed up the inference. #2823
Fixed the aclgraph steam error for moe model. #2827

Known issue

The server will be hang when running Prefill Decode Disaggregation with different TP size for P and D. It's fixed by vLLM commit which is not included in v0.10.2. You can pick this commit to fix the issue.
The HBM usage of Qwen3 Next is higher than expected. It's a known issue and we're working on it. You can set max_model_len and gpu_memory_utilization to suitable value basing on your parallel config to avoid oom error.
We notice that lora doesn't work with this release due to the refactor of kv cache. We'll fix it soon. 2941
Please do not enable chunked prefill with prefix cache when running with Ascend scheduler. The performance and accuracy is not good/correct. #2943

New Contributors

@WithHades made their first contribution in #2589
@vllm-ascend-ci made their first contribution in #2755
@1092626063 made their first contribution in #2708
@marcobarlo made their first contribution in #2039
@realliujiaxu made their first contribution in #2719
@machenglong2025 made their first contribution in #2805
@FFFrog made their first contribution in #2815
@anon189Ty made their first contribution in #2619
@zhaozx-cn made their first contribution in #2787
@wenba0 made their first contribution in #2778
@wuweiqiang24 made their first contribution in #2814
@wyu0-0 made their first contribution in #2857
@nwpu-zxr made their first contribution in #2824

Full Changelog: v0.10.1rc1...v0.10.2rc1

Contributors

machenglong2025, WithHades, and 11 other contributors

Assets 2

04 Sep 03:30

MengqingCao

v0.10.1rc1

7e16b4a

v0.10.1rc1 Pre-release

Pre-release

This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the official doc to get started.

Highlights

LoRA Performance improved much through adding Custom Kernels by China Merchants Bank. #2325
Support Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. #1568
Support capture custom ops into aclgraph now. #2113

Core

Add MLP tensor parallel to improve performance, but note that this will increase memory usage. #2120
openEuler is upgraded to 24.03. #2631
Add custom lmhead tensor parallel to achieve reduced memory consumption and improved TPOT performance. #2309
Qwen3 MoE/Qwen2.5 support torchair graph now. #2403
Support Sliding Window Attention with AscendSceduler, thus fixing Gemma3 accuracy issue. #2528

Other

Bug fixes:
- Update the graph capture size calculation, somehow alleviated the problem that npu stream not enough in some scenarios #2511
- Fix bugs and refactor cached mask generation logic. #2442
- Fix the nz format does not work in quantization scenarios. #2549
- Fix accuracy issue on Qwen series caused by enabling enable_shared_pert_dp by default. #2457
- Fix accuracy issue on models whose rope dim is not equal to head dim, e.g., GLM4.5. #2601
Performance improved through a lot of prs:
- Remove torch.cat and replace it by List[0]. #2153
- Convert the format of gmm to nz. #2474
- Optimize parallel strategies to reduce communication overhead #2198
- Optimize reject sampler in greedy situation #2137
A batch of refactoring prs to enhance the code architecture:
- Refactor on MLA. #2465
- Refactor on torchair fused_moe. #2438
- Refactor on allgather/mc2-related fused_experts. #2369
- Refactor on torchair model runner. #2208
- Refactor on CI. #2276
Parameters changes:
- Add lmhead_tensor_parallel_size in additional_config, set it to enable lmhead tensor parallel. #2309
- Some unused environ variables HCCN_PATH, PROMPT_DEVICE_ID, DECODE_DEVICE_ID, LLMDATADIST_COMM_PORT and LLMDATADIST_SYNC_CACHE_WAIT_TIME are removed. #2448
- Environ variable VLLM_LLMDD_RPC_PORT is renamed to VLLM_ASCEND_LLMDD_RPC_PORT now. #2450
- Add VLLM_ASCEND_ENABLE_MLP_OPTIMIZE in environ variables, Whether to enable mlp optimize when tensor parallel is enabled, this feature in eager mode will get better performance. #2120
- Remove MOE_ALL2ALL_BUFFER and VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ in environ variables.#2612
- Add enable_prefetch in additional_config, whether to enable weight prefetch. #2465
- Add mode in additional_config.torchair_graph_config, When using reduce-overhead mode for torchair, mode needs to be set. #2461
- enable_shared_expert_dp in additional_config is disabled by default now, and it is recommended to enable when inferencing with deepseek. #2457

Known Issues

Sliding window attention not support chunked prefill currently, thus we could only enable AscendScheduler to run with it. #2729
There is a bug with creating mc2_mask when MultiStream is enabled, will fix it in next release. #2681

New Contributors

@lidenghui1110 made their first contribution in #1917
@haojiangzheng made their first contribution in #1772
@QwertyJack made their first contribution in #2298
@LCAIZJ made their first contribution in #1568
@liuchenbing made their first contribution in #2325
@gameofdimension made their first contribution in #2407
@NicholasTao made their first contribution in #2403
@ZhaoJiangJiang made their first contribution in #2453
@s-jiayang made their first contribution in #2373
@NSDie made their first contribution in #2528
@panchao-hub made their first contribution in #2639
@zzy-ContiLearn made their first contribution in #2541
@baxingpiaochong made their first contribution in #2664

Full Changelog: v0.10.0rc1...v0.10.1rc1

Contributors

QwertyJack, NicholasTao, and 11 other contributors

Assets 2

03 Sep 10:05

wangxiyuan

v0.9.1

0740d10

v0.9.1 Latest

Latest

We are excited to announce the newest official release of vLLM Ascend. This release includes many feature supports, performance improvements and bug fixes. We recommend users to upgrade from 0.7.3 to this version. Please always set VLLM_USE_V1=1 to use V1 engine.

In this release, we added many enhancements for large scale expert parallel case. It's recommended to follow the official guide.

Please note that this release note will list all the important changes from last official release(v0.7.3)

Highlights

DeepSeek V3/R1 is supported with high quality and performance. MTP can work with DeepSeek as well. Please refer to muliti node tutorials and Large Scale Expert Parallelism.
Qwen series models work with graph mode now. It works by default with V1 Engine. Please refer to Qwen tutorials.
Disaggregated Prefilling support for V1 Engine. Please refer to Large Scale Expert Parallelism tutorials.
Automatic prefix caching and chunked prefill feature is supported.
Speculative decoding feature works with Ngram and MTP method.
MOE and dense w4a8 quantization support now. Please refer to quantization guide.
Sleep Mode feature is supported for V1 engine. Please refer to Sleep mode tutorials.
Dynamic and Static EPLB support is added. This feature is still experimental.

Note

The following notes are especially for reference when upgrading from last final release (v0.7.3):

V0 Engine is not supported from this release. Please always set VLLM_USE_V1=1 to use V1 engine with vLLM Ascend.
Mindie Turbo is not needed with this release. And the old version of Mindie Turbo is not compatible. Please do not install it. Currently all the function and enhancement is included in vLLM Ascend already. We'll consider to add it back in the future in needed.
Torch-npu is upgraded to 2.5.1.post1. CANN is upgraded to 8.2.RC1. Don't forget to upgrade them.

Core

The Ascend scheduler is added for V1 engine. This scheduler is more affine with Ascend hardware.
Structured output feature works now on V1 Engine.
A batch of custom ops are added to improve the performance.

Changes

EPLB support for Qwen3-moe model. #2000
Fix the bug that MTP doesn't work well with Prefill Decode Disaggregation. #2610 #2554 #2531
Fix few bugs to make sure Prefill Decode Disaggregation works well. #2538 #2509 #2502
Fix file not found error with shutil.rmtree in torchair mode. #2506

Known Issues

When running MoE model, Aclgraph mode only work with tensor parallel. DP/EP doesn't work in this release.
Pipeline parallelism is not supported in this release for V1 engine.
If you use w4a8 quantization with eager mode, please set VLLM_ASCEND_MLA_PARALLEL=1 to avoid oom error.
Accuracy test with some tools may not be correct. It doesn't affect the real user case. We'll fix it in the next post release. #2654
We notice that there are still some problems when running vLLM Ascend with Prefill Decode Disaggregation. For example, the memory may be leaked and the service may be stuck. It's caused by known issue by vLLM and vLLM Ascend. We'll fix it in the next post release. #2650 #2604 vLLM#22736 vLLM#23554 vLLM#23981

Assets 2

22 Aug 10:48

Yikun

v0.9.1rc3

763ed69

v0.9.1rc3 Pre-release

Pre-release

This is the 3rd release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.

Core

MTP supports V1 scheduler #2371
Add LMhead TP communication groups #1956
Fix the bug that qwen3 moe doesn't work with aclgraph #2478
Fix grammar_bitmask IndexError caused by outdated apply_grammar_bitmask method #2314
Remove chunked_prefill_for_mla #2177
Fix bugs and refactor cached mask generation logic #2326
Fix configuration check logic about ascend scheduler #2327
Cancel the verification between deepseek-mtp and non-ascend scheduler in disaggregated-prefill deployment #2368
Fix issue that failed with ray distributed backend #2306
Fix incorrect req block length in ascend scheduler #2394
Fix header include issue in rope #2398
Fix mtp config bug #2412
Fix error info and adapt attn_metedata refactor #2402
Fix torchair runtime errror caused by configuration mismtaches and .kv_cache_bytes file missing #2312
Move with_prefill allreduce from cpu to npu #2230

Docs

Add document for deepseek large EP #2339

Known Issues

Full graph mode support are not yet available for some case with full_cuda_graph enable. #2182

Full Changelog: v0.9.1rc2...v0.9.1rc3

Assets 2

07 Aug 06:48

wangxiyuan

v0.10.0rc1

4604882

v0.10.0rc1 Pre-release

Pre-release

This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the official doc to get started. V0 is completely removed from this version.

Highlights

Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model #950, following this tutorial.
W4A8 quantization method is supported for dense and MoE model now. #2060 #2172

Core

Ascend PyTorch adapter (torch_npu) has been upgraded to 2.7.1.dev20250724. #1562 And CANN has been upgraded to 8.2.RC1. #1653 Don’t forget to update them in your environment or using the latest images.
vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. #1582
Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this tutorial to have a try. #2162
Pipeline Parallelism is supported in V1 now. #1800
Prefix cache feature now work with the Ascend Scheduler. #1446
Torchair graph mode works with tp > 4 now. #1508
MTP support torchair graph mode now #2145

Other

Bug fixes:
- Fix functional problem of multi-modality models like Qwen2-audio with Aclgraph. #1803
- Fix the process group creating error with external launch scenario. #1681
- Fix the functional problem with guided decoding. #2022
- Fix the accuracy issue with common MoE models in DP scenario. #1856
Performance improved through a lot of prs:
- Caching sin/cos instead of calculate it every layer. #1890
- Improve shared expert multi-stream parallelism #1891
- Implement the fusion of allreduce and matmul in prefill phase when tp is enabled. Enable this feature by setting VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE to 1. #1926
- Optimize Quantized MoE Performance by Reducing All2All Communication. #2195
- Use AddRmsNormQuant ops in the custom model to optimize Qwen3's performance #1806
- Use multicast to avoid padding decode request to prefill size #1555
- The performance of LoRA has been improved. #1884
A batch of refactoring prs to enhance the code architecture:
- Torchair model runner refactor #2205
- Refactoring forward_context and model_runner_v1. #1979
- Refactor AscendMetaData Comments. #1967
- Refactor torchair utils. #1892
- Refactor torchair worker. #1885
- Register activation customop instead of overwrite forward_oot. #1841
Parameters changes:
- expert_tensor_parallel_size in additional_config is removed now, and the EP and TP is aligned with vLLM now. #1681
- Add VLLM_ASCEND_MLA_PA in environ variables, use this to enable mla paged attention operator for deepseek mla decode.
- Add VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE in environ variables, enable MatmulAllReduce fusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance.
- Add VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ in environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion.
UT coverage reached 76.34% after a batch of prs followed by this rfc: #1298
Sequence Parallelism works for Qwen3 MoE. #2209
Chinese online document is added now. #1870

Known Issues

Aclgraph could not work with DP + EP currently, the mainly gap is the number of npu stream that Aclgraph needed to capture graph is not enough. #2229
There is an accuracy issue on W8A8 dynamic quantized DeepSeek with multistream enabled. This will be fixed in the next release. #2232
In Qwen3 MoE, SP cannot be incorporated into the Aclgraph. #2246
MTP not support V1 scheduler currently, will fix it in Q3. #2254
When running MTP with DP > 1, we need to disable metrics logger due to some issue on vLLM. #2254
GLM 4.5 model has accuracy problem in long output length scenario.

New Contributors

@pkking made their first contribution in #1792
@lianyiibo made their first contribution in #1811
@nuclearwu made their first contribution in #1867
@aidoczh made their first contribution in #1870
@shiyuan680 made their first contribution in #1930
@ZrBac made their first contribution in #1964
@Ronald1995 made their first contribution in #1988
@taoxudonghaha made their first contribution in #1884
@hongfugui made their first contribution in #1583
@YuanCheng-coder made their first contribution in #2067
@Liccol made their first contribution in #2127
@1024daniel made their first contribution in #2037
@yangqinghao-cmss made their first contribution in #2121

Full Changelog: v0.9.2rc1...v0.10.0rc1

Contributors

pkking, Liccol, and 11 other contributors

Assets 2

06 Aug 01:15

Yikun

v0.9.1rc2

b9f715d

v0.9.1rc2 Pre-release

Pre-release

This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.

Highlights

MOE and dense w4a8 quantization support now: #1320 #1910 #1275 #1480
Dynamic EPLB support in #1943
Disaggregated Prefilling support for V1 Engine and improvement, continued development and stabilization of the disaggregated prefill feature, including performance enhancements and bug fixes for single-machine setups:#1953 #1612 #1361 #1746 #1552 #1801 #2083 #1989

Models improvement:

DeepSeek DeepSeek DBO support and improvement: #1285 #1291 #1328 #1420 #1445 #1589 #1759 #1827 #2093
DeepSeek MTP improvement and bugfix: #1214 #943 #1584 #1473 #1294 #1632 #1694 #1840 #2076 #1990 #2019
Qwen3 MoE support improvement and bugfix around graph mode and DP: #1940 #2006 #1832
Qwen3 performance improvement around rmsnorm/repo/mlp ops: #1545 #1719 #1726 #1782 #1745
DeepSeek MLA chunked prefill/graph mode/multistream improvement and bugfix: #1240 #933 #1135 #1311 #1750 #1872 #2170 #1551
Qwen2.5 VL improvement via mrope/padding mechanism improvement: #1261 #1705 #1929 #2007
Ray: Fix the device error when using ray and add initialize_cache and improve warning info: #1234 #1501

Graph mode improvement:

Fix DeepSeek with deepseek with mc2 in #1269
Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions in #1332
Fix torchair_graph_batch_sizes bug in #1570
Enable the limit of tp <= 4 for torchair graph mode in #1404
Fix rope accruracy bug #1887
Support multistream of shared experts in FusedMoE #997
Enable kvcache_nz for the decode process in torchair graph mode#1098
Fix chunked-prefill with torchair case to resolve UnboundLocalError: local variable 'decode_hs_or_q_c' issue in #1378
Improve shared experts multi-stream perf for w8a8 dynamic. in #1561
Repair moe error when set multistream. in #1882
Round up graph batch size to tp size in EP case #1610
Fix torchair bug when DP is enabled in #1727
Add extra checking to torchair_graph_config. in #1675
Fix rope bug in torchair+chunk-prefill scenario in #1693
torchair_graph bugfix when chunked_prefill is true in #1748
Improve prefill optimization to support torchair graph mode in #2090
Fix rank set in DP scenario #1247
Reset all unused positions to prevent out-of-bounds to resolve GatherV3 bug in #1397
Remove duplicate multimodal codes in ModelRunner in #1393
Fix block table shape to resolve accuracy issue in #1297
Implement primal full graph with limited scenario in #1503
Restore paged attention kernel in Full Graph for performance in #1677
Fix DeepSeek OOM issue in extreme --gpu-memory-utilization scenario in #1829
Turn off aclgraph when enabling TorchAir in #2154

Ops improvement:

add custom ascendc kernel vocabparallelembedding #796
fix rope sin/cos cache bug in #1267
Refactoring AscendFusedMoE (#1229) in #1264
Use fused ops npu_top_k_top_p in sampler #1920

Core:

Upgrade CANN to 8.2.rc1 in #2036
Upgrade torch-npu to 2.5.1.post1 in #2135
Upgrade python to 3.11 in #2136
Disable quantization in mindie_turbo in #1749
fix v0 spec decode in #1323
Enable ACL_OP_INIT_MODE=1 directly only when using V0 spec decode in #1271
Refactoring forward_context and model_runner_v1 in #1422
Fix sampling params in #1423
add a switch for enabling NZ layout in weights and enable NZ for GMM. in #1409
Resolved bug in ascend_forward_context in #1449 #1554 #1598
Address PrefillCacheHit state to fix prefix cache accuracy bug in #1492
Fix load weight error and add new e2e case in #1651
Optimize the number of rope-related index selections in deepseek. in #1614
add mc2 mask in #1642
Fix static EPLB log2phy condition and improve unit test in #1667 #1896 #2003
add chunk mc2 for prefill in #1703
Fix mc2 op GroupCoordinator bug in #1711
Fix the failure to recognize the actual type of quantization i...

Contributors

raindaywhu, NNUCJ, and 9 other contributors

Assets 2

11 Jul 09:51

wangxiyuan

v0.9.2rc1

b5b7e0e

v0.9.2rc1 Pre-release

Pre-release

This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the official doc to get started. From this release, V1 engine will be enabled by default, there is no need to set VLLM_USE_V1=1 any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.

Highlights

Pooling model works with V1 engine now. You can take a try with Qwen3 embedding model #1359.
The performance on Atlas 300I series has been improved. #1591
aclgraph mode works with Moe models now. Currently, only Qwen3 Moe is well tested. #1381

Core

Ascend PyTorch adapter (torch_npu) has been upgraded to 2.5.1.post1.dev20250619. Don’t forget to update it in your environment. #1347
The GatherV3 error has been fixed with aclgraph mode. #1416
W8A8 quantization works on Atlas 300I series now. #1560
Fix the accuracy problem with deploy models with parallel parameters. #1678
The pre-built wheel package now requires lower version of glibc. Users can use it by pip install vllm-ascend directly. #1582

Other

Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. #1331
A new env variable VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is 0. #1335
A new env variable VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION has been added to improve the performance of topk-topp sampling. The default value is 0, we'll consider to enable it by default in the future#1732
A batch of bugs have been fixed for Data Parallelism case #1273 #1322 #1275 #1478
The DeepSeek performance has been improved. #1194 #1395 #1380
Ascend scheduler works with prefix cache now. #1446
DeepSeek now works with prefix cache now. #1498
Support prompt logprobs to recover ceval accuracy in V1 #1483

Knowissue

Pipeline parallel does not work with ray and graph mode: #1751 #1754

New Contributors

@xleoken made their first contribution in #1357
@lyj-jjj made their first contribution in #1335
@sharonyunyun made their first contribution in #1194
@Pr0Wh1teGivee made their first contribution in #1308
@leo-pony made their first contribution in #1374
@zeshengzong made their first contribution in #1452
@GDzhu01 made their first contribution in #1477
@Agonixiaoxiao made their first contribution in #1531
@zhanghw0354 made their first contribution in #1476
@farawayboat made their first contribution in #1591
@ZhengWG made their first contribution in #1196
@wm901115nwpu made their first contribution in #1654

Full Changelog: v0.9.1rc1...v0.9.2rc1

Contributors

farawayboat, zhanghw0354, and 9 other contributors

Assets 2

22 Jun 07:08

Yikun

v0.9.1rc1

c30ddb8

v0.9.1rc1 Pre-release

Pre-release

This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.

Experimental

Atlas 300I series is experimental supported in this release (Functional test passed with Qwen2.5-7b-instruct/Qwen2.5-0.5b/Qwen3-0.6B/Qwen3-4B/Qwen3-8B). #1333
Support EAGLE-3 for speculative decoding. #1032

After careful consideration, above features will NOT be included in v0.9.1-dev branch (v0.9.1 final release) taking into account the v0.9.1 release quality and the feature rapid iteration. We will improve this from 0.9.2rc1 and later.

Core

Ascend PyTorch adapter (torch_npu) has been upgraded to 2.5.1.post1.dev20250528. Don’t forget to update it in your environment. #1235
Support Atlas 300I series container image. You can get it from quay.io
Fix token-wise padding mechanism to make multi-card graph mode work. #1300
Upgrade vLLM to 0.9.1 [#1165]#1165

Other Improvements

Initial support Chunked Prefill for MLA. #1172
An example of best practices to run DeepSeek with ETP has been added. #1101
Performance improvements for DeepSeek using the TorchAir graph. #1098, #1131
Supports the speculative decoding feature with AscendScheduler. #943
Improve VocabParallelEmbedding custom op performance. It will be enabled in the next release. #796
Fixed a device discovery and setup bug when running vLLM Ascend on Ray #884
DeepSeek with MC2 (Merged Compute and Communication) now works properly. #1268
Fixed log2phy NoneType bug with static EPLB feature. #1186
Improved performance for DeepSeek with DBO enabled. #997, #1135
Refactoring AscendFusedMoE #1229
Add initial user stories page (include LLaMA-Factory/TRL/verl/MindIE Turbo/GPUStack) #1224
Add unit test framework #1201

Known Issues

In some cases, the vLLM process may crash with a GatherV3 error when aclgraph is enabled. We are working on this issue and will fix it in the next release. #1038
Prefix cache feature does not work with the Ascend Scheduler but without chunked prefill enabled. This will be fixed in the next release. #1350

Full Changelog

v0.9.0rc2...v0.9.1rc1

New Contributors

@farawayboat made their first contribution in #1333
@yzim made their first contribution in #1159
@chenwaner made their first contribution in #1098
@wangyanhui-cmss made their first contribution in #1184
@songshanhu07 made their first contribution in #1186
@yuancaoyaoHW made their first contribution in #1032

Full Changelog: v0.9.0rc2...v0.9.1rc1

Contributors

farawayboat, yzim, and 4 other contributors

Assets 2

10 Jun 14:29

wangxiyuan

v0.9.0rc2

8dd686d

v0.9.0rc2 Pre-release

Pre-release

This is the 2nd official release candidate of v0.9.0 for vllm-ascend. Please follow the official doc to start the journey. From this release, V1 Engine is recommended to use. The code of V0 Engine is frozen and will not be maintained any more. Please set environment VLLM_USE_V1=1 to enable V1 Engine.

Highlights

DeepSeek works with graph mode now. Follow the official doc to take a try. #789
Qwen series models works with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set enforce_eager=True when initializing the model.

Core

The performance of multi-step scheduler has been improved. Thanks for the contribution from China Merchants Bank. #814
LoRA、Multi-LoRA And Dynamic Serving is supported for V1 Engine now. Thanks for the contribution from China Merchants Bank. #893
prefix cache and chunked prefill feature works now #782 #844
Spec decode and MTP features work with V1 Engine now. #874 #890
DP feature works with DeepSeek now. #1012
Input embedding feature works with V0 Engine now. #916
Sleep mode feature works with V1 Engine now. #1084

Model

Qwen2.5 VL works with V1 Engine now. #736
LLama4 works now. #740
A new kind of DeepSeek model called dual-batch overlap(DBO) is added. Please set VLLM_ASCEND_ENABLE_DBO=1 to use it. #941

Other

online serve with ascend quantization works now. #877
A batch of bugs for graph mode and moe model have been fixed. #773 #771 #774 #816 #817 #819 #912 #897 #961 #958 #913 #905
A batch of performance improvement PRs have been merged. #784 #803 #966 #839 #970 #947 #987 #1085
From this release, binary wheel package will be released as well. #775
The contributor doc site is added

Known Issue

In some case, vLLM process may be crashed with aclgraph enabled. We're working this issue and it'll be fixed in the next release. #1038
Multi node data-parallel doesn't work with this release. This is a known issue in vllm and has been fixed on main branch. #18981