vLLM Ascend Roadmap Q2 2025 #448

Yikun · 2025-03-31T12:14:16Z

This is a living document!

Our vision is to enable vLLM to run seamlessly on Ascend NPU. We are fully committed to making vLLM one of the best engines for Ascend NPU. In Q1 2025, we have provided initial support for vLLM on Ascend NPU.

In 2025 Q2, we will focus on 4 themes: vLLM Ascend for Production, Performance Optimization, Key Features, Ecosystem Connect.

1. Performance Optimization

We will focus on the performance optimization of dense models (Qwen/Llama/Qwen-VL) and MOE models (DeepSeek V3/R1), users of vLLM Ascend can trust its performance to be competitive for Ascend NPU.

(P0) [V0] Support torch.compile (aka Graph mode): [Feature] Graph mode for deepseek. #254 support aclgraph #426
(P0) MindIE-turbo integration pip install vllm-ascend[mindie-turbo] @MengqingCao
(P0) vLLM V1 engine improvement RFC Llink @wangxiyuan
(P1) Performance report for DeepSeek R1, Llama3, Qwen2.5, Qwen2-VL, Qwen2.5-VL. Add official Performance Guide Doc

2. vLLM Ascend for Production

Align with vLLM, vLLM Ascend is designed for production, the first official version for vLLM 0.7.3 will be published, we will also actively promote the key features of vLLM v0.8.x/v1 to production availability

(P0) Stable Plugin Architecture for hardware platforms @wangxiyuan
(P0) CI: Model/Feature coverage: [RFC]: CI for key features #413 @MengqingCao
(P0) Testing: Performance test @leo-pony @hfadzxy
(P0) Testing: Accuracy test @leo-pony @hfadzxy
(P0) Testing: Stress and longevity test @leo-pony @hfadzxy

3. Key Features

We will focus on the integration and support of key lifecycle workflow of model training (SFT / RL) and inference (singe node / distributed).

3.1 Workflows

Cluster Scale Serving

(P0) MLA enhancements: MLAPO + MLA support
(P0) Distributed EP: EP + DP
(P0) Prefill Decode Disaggregation: 1P1D, xPyD: [RFC]: P/D Disaggregation Support #841
(P0) EPLB: [Performance] Add EPLB expert map import capabilities #860

Core feature support

(P0) LoRA / MultiLora: [RFC]: Join the MultiLora and MultiLora Dynammic Serving feature develop #396 @ZhengJun9
(P1) Structured Output on V1: [Feature]: Add support for Guided Decoding #177 @shen-shanshan
(Help wanted) Prompt adapter

RLHF

Sleep mode: Add sleep mode feature for Ascend NPU #416 @celestialli

3.2 Models support

(P0) Quantization support: w8a8 (DeepSeek R1 with 2 nodes): [quantization] Support w8a8 quantization #580
(P1) Upcoming Qwen3 / DeepSeek-R2 / Llama4/DeepSeek DRM series new models support
(P1) Qwen-Omini-thinker
(P1) Model format support: gguf
(Help wanted) Quantization support: w4a16/w4a8 (DeepSeek R1 with 1 node)
(Help wanted) Whisper
(Help wanted) enc-dec
(Help wanted) Gemma

3.3 User / Developer Experience

Distributions

(P0) Docker image (mirror)
- Add ascend to allows.txt DaoCloud/public-image-mirror#41777
(P0) Python Wheel

Docs

Developer Design doc
Developer Evaluation doc: [Doc]: Add developer guide docs for OpenCompass evaluation #367

Dashboard

Perf Dashboard
Accuracy DashBoard

3.4 Hardware support

310 series supported: [WIP][Platform] Add support for Ascend 310P #811

4. Ecosystem Connect

It is key to seamlessly integrate key lifecycle components with vLLM Ascend, so we are also actively connecting with the ecosystem.

(P1) [SFT] LLaMA-Factory: Support inference with vLLM-Ascend hiyouga/LLaMA-Factory#7739
(P1) [RLHF] verl: [RFC] [sub roadmap] [25Q2] Add Ascend NPU support for verl volcengine/verl#900
(P1) [RLHF] OpenRLHF: [WIP] support Ascend NPU backend OpenRLHF/OpenRLHF#605
(P1) [RLHF] MindSpeed-RL: https://github.com/Ascend/MindSpeed-RL
(P1) [RLHF] TRL: 🧗 Add Ascend NPU support for vLLM server huggingface/trl#3286
(P1) [Deploy] GPUStack Support vllm-ascend gpustack/gpustack#1495

If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Historical Roadmap: #71

The text was updated successfully, but these errors were encountered:

AlphaINF · 2025-03-31T13:57:32Z

您好，请问华为的Ascend Atlas 300I Duo的设备，是否会在Q2实现支持？且下列功能是否能够同步实现适配?

lora
tensor-parallel

Yikun · 2025-03-31T23:25:09Z

@AlphaINF We don't have a certain plans for it yet, but we welcome contributions, feel free to open PR and issue/RFC.

According the support experience for other engine [1], it's not a difficult thing to have a initial suppport, but in order to meet the performance requirement, need more efforts and bandwidth.

[1] ggml-org/llama.cpp#10216

AlphaINF · 2025-04-01T03:18:14Z

@Yikun Thanks!

shen-shanshan · 2025-04-07T03:23:55Z

Complement to Key Features: [P1] Structured Output on V1: #177

### What this PR does / why we need it? According to this RFC [[RFC]: Join the MultiLora and MultiLora Dynammic Serving feature develop #396](#396) and this [vLLM Ascend Roadmap Q2 2025 #448](#448), we pull request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA Dynamic Serving. LoRA reference is here: [LoRA reference](https://docs.vllm.ai/en/latest/features/lora.html) ### Does this PR introduce _any_ user-facing change? Following openai HTTP apis will be supported: /v1/load_lora_adapter /v1/unload_lora_adapter ### How was this patch tested? git clone https://github.com/vllm-project/vllm.git cd vllm/examples/offline_inference/ && python3 multilora_inference.py --------- Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>

jianzs · 2025-04-17T10:48:00Z

✋🏻 I am working on implementing xPyD and will submit a wip PR later.

…roject#521) ### What this PR does / why we need it? According to this RFC [[RFC]: Join the MultiLora and MultiLora Dynammic Serving feature develop vllm-project#396](vllm-project#396) and this [vLLM Ascend Roadmap Q2 2025 vllm-project#448](vllm-project#448), we pull request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA Dynamic Serving. LoRA reference is here: [LoRA reference](https://docs.vllm.ai/en/latest/features/lora.html) ### Does this PR introduce _any_ user-facing change? Following openai HTTP apis will be supported: /v1/load_lora_adapter /v1/unload_lora_adapter ### How was this patch tested? git clone https://github.com/vllm-project/vllm.git cd vllm/examples/offline_inference/ && python3 multilora_inference.py --------- Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>

### What this PR does / why we need it? According to this [RFC]( #396 ) and [this](#448), we pull request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA Dynamic Serving. LoRA reference is here: [LoRA reference](https://docs.vllm.ai/en/latest/features/lora.html) ### Does this PR introduce _any_ user-facing change? Following openai HTTP apis will be supported: /v1/load_lora_adapter /v1/unload_lora_adapter ### How was this patch tested? git clone [https://github.com/vllm-project/vllm.git](https://github.com/vllm-project/vllm.git) cd vllm/examples/offline_inference/ && python3 multilora_inference.py > [[Release]: vLLM Ascend v0.7.3 release checklist ](#644 (comment)) --------- Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu12 <507435917@qq.com> Co-authored-by: paulyu <paulyu0307@gmail.com>

What this PR does / why we need it? According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine Does this PR introduce any user-facing change? Following openai HTTP apis will be supported: /v1/load_lora_adapter /v1/unload_lora_adapter How was this patch tested? git clone https://github.com/vllm-project/vllm.git cd vllm/examples/offline_inference/ && python3 multilora_inference.py Signed-off-by: jesse <szxfml@gmail.com>

Yikun pinned this issue Mar 31, 2025

wangxiyuan mentioned this issue Mar 31, 2025

[Roadmap] vLLM Roadmap Q2 2025 vllm-project/vllm#15735

Open

66 tasks

Yikun mentioned this issue Apr 9, 2025

vLLM Ascend Roadmap Q1 2025 #71

Closed

38 tasks

paulyu12 mentioned this issue Apr 14, 2025

[Platform][Worker][ModelRunner] Add LoRA & Multi-LoRA support #521

Merged

ZhengJun9 mentioned this issue Apr 28, 2025

Add LoRA & Multi-LoRA support for V0.7.3 dev by Cherry Pick #700

Merged

jianzs mentioned this issue May 6, 2025

[Feature][1/2] Impl the connector based on the llmdatadist for v1 #684

Open

5 tasks

jesse996 mentioned this issue May 9, 2025

add V1 Engine LoRA support #801

Closed

Kuangdd01 mentioned this issue May 11, 2025

qwen38B模型部署 hiyouga/LLaMA-Factory#8011

Closed

1 task

wangxiyuan mentioned this issue May 14, 2025

[Guide] Official Guide Index #840

Open

Yikun mentioned this issue May 15, 2025

[Usage]: 4*910B2 部署deepseek r1/v3 报错 #485

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM Ascend Roadmap Q2 2025 #448

vLLM Ascend Roadmap Q2 2025 #448

Yikun commented Mar 31, 2025 •

edited

Loading

AlphaINF commented Mar 31, 2025

Yikun commented Mar 31, 2025 •

edited

Loading

AlphaINF commented Apr 1, 2025

shen-shanshan commented Apr 7, 2025

jianzs commented Apr 17, 2025

vLLM Ascend Roadmap Q2 2025 #448

vLLM Ascend Roadmap Q2 2025 #448

Comments

Yikun commented Mar 31, 2025 • edited Loading

1. Performance Optimization

2. vLLM Ascend for Production

3. Key Features

3.1 Workflows

3.2 Models support

3.3 User / Developer Experience

3.4 Hardware support

4. Ecosystem Connect

AlphaINF commented Mar 31, 2025

Yikun commented Mar 31, 2025 • edited Loading

AlphaINF commented Apr 1, 2025

shen-shanshan commented Apr 7, 2025

jianzs commented Apr 17, 2025

Yikun commented Mar 31, 2025 •

edited

Loading

Yikun commented Mar 31, 2025 •

edited

Loading