[Feature][1/2] Impl the connector based on the llmdatadist for v1 #684

jianzs · 2025-04-27T13:26:26Z

This PR implements the connector functionality for NPU based on LLMDataDist, building upon the connector API merged in vLLM v1. (vllm-project/vllm#15960) We've successfully tested various scenarios in offline environments:

Single-machine: Verified 2P2D testing with dense models (Llama) and MoE models (DeepSeek v2 Lite)
Two-machine: Completed 1P1D testing with DeepSeek R1 W8A8

Key implementation aspects include:

Cross-machine PD: LLMDataDist requires NPU device IP for connection establishment. Our approach utilizes a global rank table (JSON) on each machine containing:

Unique server IDs
IP addresses and device IDs for each card
Server ID specification in connector extra config at startup for instance information retrieval

nPmD: Given that the community's nPmD design, particularly the router component API, is still evolving, we've implemented a solution using a meta server component (to be provided separately) that:

Records prefill completion details (device and dp rank information)
Responds to decode node queries with prefill node locations
Enables decode nodes to retrieve data from appropriate prefill nodes

We propose initially merging the 1P1D implementation, where the global rank table contains information for two nodes, allowing direct prefill node identification. The nPmD implementation can be refined and merged following community discussion.

Todo:

Implement 1P1D (one prefill, one decode) configuration support
Add sample script for automatic global rank table generation
Document global rank table format specifications
Provide user guide for PD (Prefill-Decode) functionality
Add test cases

re #448

Note:
A minor modification to vLLM's codebase is required to run this example successfully. The patch enables the scheduler process to locate the appropriate connector class by importing the necessary module.

The change should be made in vllm/v1/core/sched/scheduler.py, adding an import statement for vllm_ascend.distributed.

This is a temporary solution, and we need to implement a more elegant module discovery mechanism.

diff --git a/vllm/v1/core/sched/scheduler.py b/vllm/v1/core/sched/scheduler.py
index 69e7cc8ee..b15525971 100644
--- a/vllm/v1/core/sched/scheduler.py
+++ b/vllm/v1/core/sched/scheduler.py
@@ -31,6 +31,8 @@ from vllm.v1.structured_output import StructuredOutputManager
 
 logger = init_logger(__name__)
 
+# TODO(jianzs): Find the suitable place to put this.
+import vllm_ascend.distributed
 
 class Scheduler(SchedulerInterface):

Limits:

We use a hash function (string_to_int64_hash) to convert request IDs to datadist request IDs. This conversion is lossy, potentially creating duplicate IDs, leading to duplicate CacheKeys and allocate_cache failures.

jianzs · 2025-05-06T03:01:32Z

Measure the time it takes for KV transfers at different sequence lengths.

Environment:

Two machines
1P1D
DeepSeek R1 W8A8
TP=16，DP=1

Stacked charts show higher times than overall charts because each stage of measurement has NPU synchronization before and after, especially extract kv, scatter update, and inject kv, all of which synchronize on each layer, resulting in significant host overhead.

load kv = pull cache + inject kv
save kv = extract kv + scatter update

Kevin-XiongC · 2025-05-08T03:48:46Z

Hi, I tried this PR. But it seems like there exists a precision issue here. The prompt is "What is the largest animal in the world?" with temperature == 0 using qwen2.5 0.5b.

using PD disaggregated

{"id":"cmpl-request2005","object":"text_completion","created":1746675979,"model":"/deepseek/Qwen2.5-0.5B-Instruct","choices":[{"index":0,"text":" T of the\nTheorem's Conclusion.\nIn this case, we have to find the value that will make the equation true. We can do so by substituting the given values into the original equation","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":10,"total_tokens":50,"completion_tokens":40,"prompt_tokens_details":null}}

normal

{"id":"cmpl-request16718","object":"text_completion","created":1746676100,"model":"/deepseek/Qwen2.5-0.5B-Instruct","choices":[{"index":0,"text":" The largest animal in the world, as of 2021, was the blue whale. Blue whales can grow up to weigh over 150 tons and be up to 30","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":9,"total_tokens":49,"completion_tokens":40,"prompt_tokens_details":null}}

jianzs · 2025-05-08T07:48:51Z

Hi, I tried this PR. But it seems like there exists a precision issue here. The prompt is "What is the largest animal in the world?" with temperature == 0 using qwen2.5 0.5b.

using PD disaggregated

{"id":"cmpl-request2005","object":"text_completion","created":1746675979,"model":"/deepseek/Qwen2.5-0.5B-Instruct","choices":[{"index":0,"text":" T of the\nTheorem's Conclusion.\nIn this case, we have to find the value that will make the equation true. We can do so by substituting the given values into the original equation","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":10,"total_tokens":50,"completion_tokens":40,"prompt_tokens_details":null}}

normal

{"id":"cmpl-request16718","object":"text_completion","created":1746676100,"model":"/deepseek/Qwen2.5-0.5B-Instruct","choices":[{"index":0,"text":" The largest animal in the world, as of 2021, was the blue whale. Blue whales can grow up to weigh over 150 tons and be up to 30","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":9,"total_tokens":49,"completion_tokens":40,"prompt_tokens_details":null}}

Thank you for reporting this issue. I've tested with DeepSeek v2 Lite and Llama2 7B, and observed that:

The first response is incorrect
Second and subsequent responses are as expected

Could you confirm if you're seeing incorrect responses consistently in your tests? And are your configurations, including parallelism, identical in both the disaggregated and singlestone environments?

Kevin-XiongC · 2025-05-08T08:29:51Z

Yes, I see this consistently.

I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b.

and as the aggregated one. I've used the default settings below.

 python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct

as for the ranktable

{
    "server_group_list":[
        {
            "group_id": "0",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "1",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-0",
                    "device": [
                        {
                            "device_id": "0",
                            "device_ip": "172.22.17.1",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "2",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-1",
                    "device": [
                        {
                            "device_id": "4",
                            "device_ip": "172.22.17.5",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        }
    ]
}

jianzs · 2025-05-08T09:37:51Z

Yes, I see this consistently.

I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b.

and as the aggregated one. I've used the default settings below.

 python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct

as for the ranktable

{
    "server_group_list":[
        {
            "group_id": "0",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "1",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-0",
                    "device": [
                        {
                            "device_id": "0",
                            "device_ip": "172.22.17.1",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "2",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-1",
                    "device": [
                        {
                            "device_id": "4",
                            "device_ip": "172.22.17.5",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        }
    ]
}

I fixed an accuracy issue. Please try again.

Kevin-XiongC · 2025-05-08T10:00:59Z

Yes, I see this consistently.
I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b.
and as the aggregated one. I've used the default settings below.

 python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct

as for the ranktable

{
    "server_group_list":[
        {
            "group_id": "0",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "1",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-0",
                    "device": [
                        {
                            "device_id": "0",
                            "device_ip": "172.22.17.1",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "2",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-1",
                    "device": [
                        {
                            "device_id": "4",
                            "device_ip": "172.22.17.5",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        }
    ]
}

I fixed an accuracy issue. Please try again.

Hi, thx for your work. Unfortunately, it still produces inconsistent results with the 0.5B model, but when I switch to 1.5B model, the disaggregated version produces the correct one. I hope it helps

jianzs · 2025-05-09T02:56:48Z

Yes, I see this consistently.
I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b.
and as the aggregated one. I've used the default settings below.

 python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct

as for the ranktable

{
    "server_group_list":[
        {
            "group_id": "0",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "1",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-0",
                    "device": [
                        {
                            "device_id": "0",
                            "device_ip": "172.22.17.1",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "2",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-1",
                    "device": [
                        {
                            "device_id": "4",
                            "device_ip": "172.22.17.5",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        }
    ]
}

I fixed an accuracy issue. Please try again.

Hi, thx for your work. Unfortunately, it still produces inconsistent results with the 0.5B model, but when I switch to 1.5B model, the disaggregated version produces the correct one. I hope it helps

Thx. Fix a bug, please try again...

Kevin-XiongC · 2025-05-09T06:30:55Z

Yes, I see this consistently.
I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b.
and as the aggregated one. I've used the default settings below.

 python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct

as for the ranktable

{
    "server_group_list":[
        {
            "group_id": "0",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "1",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-0",
                    "device": [
                        {
                            "device_id": "0",
                            "device_ip": "172.22.17.1",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "2",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-1",
                    "device": [
                        {
                            "device_id": "4",
                            "device_ip": "172.22.17.5",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        }
    ]
}

I fixed an accuracy issue. Please try again.

Hi, thx for your work. Unfortunately, it still produces inconsistent results with the 0.5B model, but when I switch to 1.5B model, the disaggregated version produces the correct one. I hope it helps

Thx. Fix a bug, please try again...

Great! It works for me now.

ganyi1996ppo · 2025-05-11T02:48:27Z

vllm_ascend/distributed/llmdatadist_connector_v1.py

+                self.num_layers, kv_cache_shape, kv_hidden_dtype)
+            self._attach_kv_buffer(kv_buffer)
+
+            target_tp_rank = self.tp_rank % min(


Why mod the min size of the tp size of pd? Can't it just adopt the tp rank.

This design originally aimed to support heterogeneous parallelism between prefill and decode phases. For scenarios where prefill TP size < decode TP size, each rank could determine its connection count using the modulo method.

However, due to current LLMDataDist constraints, decode TP size must be ≤ prefill TP size. Consequently, using either modulo operation or direct TP rank assignment achieves identical results.

ganyi1996ppo · 2025-05-11T02:57:07Z

examples/disaggregated-prefill-v1/disaggregated_prefill_multi_prefill.sh

+        "kv_buffer_device": "npu",
+        "kv_role": "kv_producer",
+        "kv_rank": 0,
+        "kv_parallel_size": 2,


What dose this kv_parallel_size do?

The v0 implementation needed this, but I'm unsure if it's still necessary.

ganyi1996ppo · 2025-05-11T03:00:41Z

The code looks good to me in general, but I'm not very familiar with the llmdatadist. Can @whx-sjtu review this PR for some of its detail?

whx-sjtu · 2025-05-11T15:35:21Z

vllm_ascend/distributed/llmdatadist_connector_v1.py

+    device_ip: str
+    dp_rank: int
+    tp_rank: int
+    cluster_id: int


You may need to add a new member super_device_id if you want to run disaggregated-prefill on A3 super node.

Yikun · 2025-05-27T11:26:02Z

@jianzs IMO, If we can't do e2e test, I prefer to merge after fix the issue. Unless this blocks something.

jianzs · 2025-05-29T02:48:13Z

@jianzs IMO, If we can't do e2e test, I prefer to merge after fix the issue. Unless this blocks something.

@Yikun I saw the connector @ganyi1996ppo submitted. I'm wondering if there's still a need for both of them, as their implementation methods are very similar.

github-actions · 2025-06-03T09:39:43Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

Ensure correct input for npu_reshape_and_cache function The 'slot_indices' parameter of npu_reshape_and_cache must be: - A torch.int32 tensor - Located on the NPU device Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

Eliminates the need to launch the meta server in the 1p1d environment. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

github-actions bot added the module:core label Apr 27, 2025

jianzs mentioned this pull request Apr 27, 2025

[Feature] Support the v1 connector API #605

Open

jianzs force-pushed the zhengsj/datadist-conn-v1 branch 3 times, most recently from 2208ee4 to eb2591c Compare April 28, 2025 03:20

jianzs changed the title ~~[WIP][Feature] Impl the connector based on the llmdatadist for v1~~ [Feature] Impl the connector based on the llmdatadist for v1 Apr 28, 2025

jianzs force-pushed the zhengsj/datadist-conn-v1 branch from 01e6bd3 to 70e4719 Compare April 28, 2025 07:05

jianzs mentioned this pull request Apr 28, 2025

[Bug]: Cannot use PD separation feature with v0.8.4rc1 #696

Open

jianzs force-pushed the zhengsj/datadist-conn-v1 branch 4 times, most recently from f984e3b to c2318fe Compare May 6, 2025 00:59

jianzs force-pushed the zhengsj/datadist-conn-v1 branch 3 times, most recently from 0c21251 to 247e501 Compare May 7, 2025 14:19

jianzs force-pushed the zhengsj/datadist-conn-v1 branch from 247e501 to 974b99e Compare May 8, 2025 06:22

jianzs force-pushed the zhengsj/datadist-conn-v1 branch from 7863de0 to 03ae1bf Compare May 9, 2025 02:58

ganyi1996ppo reviewed May 11, 2025

View reviewed changes

jianzs force-pushed the zhengsj/datadist-conn-v1 branch from 0e82eb4 to 5c752ed Compare May 11, 2025 02:58

whx-sjtu reviewed May 11, 2025

View reviewed changes

wangxiyuan removed the ready read for review label May 28, 2025

github-actions bot added the merge-conflicts label Jun 3, 2025

Yikun mentioned this pull request Jun 13, 2025

Add ShouJian Zheng (@jianzs) as vLLM Ascend maintainer #1203

Merged

jianzs added 17 commits June 15, 2025 17:17

feat: impl the connector based on the llmdatadist for v1

117a109

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

feat: resolve npu_reshape_and_cache error

6dfbaa6

Ensure correct input for npu_reshape_and_cache function The 'slot_indices' parameter of npu_reshape_and_cache must be: - A torch.int32 tensor - Located on the NPU device Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

chore: lint code

193ef5a

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

feat: add offline inference example

c65fde9

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

feat: simplify 1p1d startup

2fbd029

Eliminates the need to launch the meta server in the 1p1d environment. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

chore: rename file

6311b1b

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

chore: lint code

d976098

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

fix: resolve import issue when running with vllm 0.8.4

2d698e3

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

chore: remove v0.8.4 patch

3c829a3

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

chore: refine the init optins of datadist

385ed31

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

feat: refine linking logic

9e77d4c

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

fix: manage KV cache buffer lifecycle to prevent premature deallocation

8bf8459

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

fix: correct finding the kv cache shape for mha

945aa93

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

fix: reverse iteration over link_rets to safely remove clusters

449b14e

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

fix: correct block_ids assignment

36a1983

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

fix: typo

715a569

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

avoid memory lack

969159e

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

jianzs force-pushed the zhengsj/datadist-conn-v1 branch from 540a57e to 969159e Compare June 15, 2025 09:18

github-actions bot removed the merge-conflicts label Jun 15, 2025

jianzs added 2 commits June 15, 2025 17:20

chore: lint code

4f34a29

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

fix: allocation failure for cache_tensor despite sufficient mbuf

02e9248

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

jianzs added the pd-test enable pd test for PR label Jun 15, 2025

jianzs added 2 commits June 15, 2025 19:34

works

96bc809

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

chore: lint

3f1b033

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

[Feature][1/2] Impl the connector based on the llmdatadist for v1 #684

Are you sure you want to change the base?

[Feature][1/2] Impl the connector based on the llmdatadist for v1 #684

Conversation

jianzs commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jianzs commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kevin-XiongC commented May 8, 2025

Uh oh!

jianzs commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kevin-XiongC commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jianzs commented May 8, 2025

Uh oh!

Kevin-XiongC commented May 8, 2025

Uh oh!

jianzs commented May 9, 2025

Uh oh!

Kevin-XiongC commented May 9, 2025

Uh oh!

ganyi1996ppo May 11, 2025

Choose a reason for hiding this comment

Uh oh!

jianzs May 11, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo May 11, 2025

Choose a reason for hiding this comment

Uh oh!

jianzs May 11, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo commented May 11, 2025

Uh oh!

whx-sjtu May 11, 2025

Choose a reason for hiding this comment

Uh oh!

Yikun commented May 27, 2025

Uh oh!

jianzs commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

Uh oh!

jianzs commented Apr 27, 2025 •

edited

Loading

jianzs commented May 6, 2025 •

edited

Loading

jianzs commented May 8, 2025 •

edited

Loading

Kevin-XiongC commented May 8, 2025 •

edited

Loading

jianzs commented May 29, 2025 •

edited

Loading