Skip to content

vLLM Ascend Roadmap Q2 2025 #448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
7 of 38 tasks
Yikun opened this issue Mar 31, 2025 · 5 comments
Open
7 of 38 tasks

vLLM Ascend Roadmap Q2 2025 #448

Yikun opened this issue Mar 31, 2025 · 5 comments

Comments

@Yikun
Copy link
Collaborator

Yikun commented Mar 31, 2025

This is a living document!


Our vision is to enable vLLM to run seamlessly on Ascend NPU. We are fully committed to making vLLM one of the best engines for Ascend NPU. In Q1 2025, we have provided initial support for vLLM on Ascend NPU.

In 2025 Q2, we will focus on 4 themes: vLLM Ascend for Production, Performance Optimization, Key Features, Ecosystem Connect.

1. Performance Optimization

We will focus on the performance optimization of dense models (Qwen/Llama/Qwen-VL) and MOE models (DeepSeek V3/R1), users of vLLM Ascend can trust its performance to be competitive for Ascend NPU.

2. vLLM Ascend for Production

Align with vLLM, vLLM Ascend is designed for production, the first official version for vLLM 0.7.3 will be published, we will also actively promote the key features of vLLM v0.8.x/v1 to production availability

3. Key Features

We will focus on the integration and support of key lifecycle workflow of model training (SFT / RL) and inference (singe node / distributed).

3.1 Workflows

Cluster Scale Serving

Core feature support

RLHF

3.2 Models support

  • (P0) Quantization support: w8a8 (DeepSeek R1 with 2 nodes): [quantization] Support w8a8 quantization #580
  • (P1) Upcoming Qwen3 / DeepSeek-R2 / Llama4/DeepSeek DRM series new models support
  • (P1) Qwen-Omini-thinker
  • (P1) Model format support: gguf
  • (Help wanted) Quantization support: w4a16/w4a8 (DeepSeek R1 with 1 node)
  • (Help wanted) Whisper
  • (Help wanted) enc-dec
  • (Help wanted) Gemma

3.3 User / Developer Experience

Distributions

Docs

Dashboard

  • Perf Dashboard
  • Accuracy DashBoard

3.4 Hardware support

4. Ecosystem Connect

It is key to seamlessly integrate key lifecycle components with vLLM Ascend, so we are also actively connecting with the ecosystem.


If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Historical Roadmap: #71

@AlphaINF
Copy link

您好,请问华为的Ascend Atlas 300I Duo的设备,是否会在Q2实现支持?且下列功能是否能够同步实现适配?

  1. lora
  2. tensor-parallel

@Yikun
Copy link
Collaborator Author

Yikun commented Mar 31, 2025

@AlphaINF We don't have a certain plans for it yet, but we welcome contributions, feel free to open PR and issue/RFC.

According the support experience for other engine [1], it's not a difficult thing to have a initial suppport, but in order to meet the performance requirement, need more efforts and bandwidth.

[1] ggml-org/llama.cpp#10216

@AlphaINF
Copy link

AlphaINF commented Apr 1, 2025

@Yikun Thanks!

@shen-shanshan
Copy link
Collaborator

Complement to Key Features: [P1] Structured Output on V1: #177

@Yikun Yikun mentioned this issue Apr 9, 2025
38 tasks
wangxiyuan pushed a commit that referenced this issue Apr 17, 2025
### What this PR does / why we need it?
According to this RFC [[RFC]: Join the MultiLora and MultiLora Dynammic
Serving feature develop
#396](#396) and this
[vLLM Ascend Roadmap Q2 2025
#448](#448), we pull
request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA
Dynamic Serving.

LoRA reference is here: [LoRA
reference](https://docs.vllm.ai/en/latest/features/lora.html)

### Does this PR introduce _any_ user-facing change?

Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

### How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

---------

Signed-off-by: paulyu <paulyu0307@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
@jianzs
Copy link
Collaborator

jianzs commented Apr 17, 2025

✋🏻 I am working on implementing xPyD and will submit a wip PR later.

ttanzhiqiang pushed a commit to ttanzhiqiang/vllm-ascend that referenced this issue Apr 27, 2025
…roject#521)

### What this PR does / why we need it?
According to this RFC [[RFC]: Join the MultiLora and MultiLora Dynammic
Serving feature develop
vllm-project#396](vllm-project#396) and this
[vLLM Ascend Roadmap Q2 2025
vllm-project#448](vllm-project#448), we pull
request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA
Dynamic Serving.

LoRA reference is here: [LoRA
reference](https://docs.vllm.ai/en/latest/features/lora.html)

### Does this PR introduce _any_ user-facing change?

Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

### How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

---------

Signed-off-by: paulyu <paulyu0307@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
paulyu12 added a commit to paulyu12/vllm-ascend that referenced this issue Apr 28, 2025
…roject#521)

### What this PR does / why we need it?
According to this RFC [[RFC]: Join the MultiLora and MultiLora Dynammic
Serving feature develop
vllm-project#396](vllm-project#396) and this
[vLLM Ascend Roadmap Q2 2025
vllm-project#448](vllm-project#448), we pull
request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA
Dynamic Serving.

LoRA reference is here: [LoRA
reference](https://docs.vllm.ai/en/latest/features/lora.html)

### Does this PR introduce _any_ user-facing change?

Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

### How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

---------

Signed-off-by: paulyu <paulyu0307@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
ganyi1996ppo pushed a commit that referenced this issue Apr 28, 2025
### What this PR does / why we need it?
According to this [RFC](
#396 ) and
[this](#448), we pull
request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA
Dynamic Serving.
LoRA reference is here: [LoRA
reference](https://docs.vllm.ai/en/latest/features/lora.html)

### Does this PR introduce _any_ user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

### How was this patch tested?
git clone
[https://github.com/vllm-project/vllm.git](https://github.com/vllm-project/vllm.git)
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

> [[Release]: vLLM Ascend v0.7.3 release checklist
](#644 (comment))

---------

Signed-off-by: paulyu <paulyu0307@gmail.com>
Co-authored-by: paulyu12 <507435917@qq.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 12, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 12, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 12, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 12, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 12, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 12, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 12, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 13, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 13, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 13, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 13, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 14, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 14, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 14, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 14, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 14, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 14, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 14, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 15, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 15, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 15, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 15, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 15, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 16, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 16, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 16, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
jesse996 added a commit to jesse996/vllm-ascend that referenced this issue May 17, 2025
What this PR does / why we need it?
According to this RFC vllm-project#396 and this vllm-project#448, we pull request relavant code to support LoRA in v1 Engine

Does this PR introduce any user-facing change?
Following openai HTTP apis will be supported:
/v1/load_lora_adapter
/v1/unload_lora_adapter

How was this patch tested?
git clone https://github.com/vllm-project/vllm.git
cd vllm/examples/offline_inference/ && python3 multilora_inference.py

Signed-off-by: jesse <szxfml@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants