Skip to content

[Feature][1/2] Impl the connector based on the llmdatadist for v1 #684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

jianzs
Copy link
Collaborator

@jianzs jianzs commented Apr 27, 2025

This PR implements the connector functionality for NPU based on LLMDataDist, building upon the connector API merged in vLLM v1. (vllm-project/vllm#15960) We've successfully tested various scenarios in offline environments:

  • Single-machine: Verified 2P2D testing with dense models (Llama) and MoE models (DeepSeek v2 Lite)
  • Two-machine: Completed 1P1D testing with DeepSeek R1 W8A8

Key implementation aspects include:

Cross-machine PD: LLMDataDist requires NPU device IP for connection establishment. Our approach utilizes a global rank table (JSON) on each machine containing:

  • Unique server IDs
  • IP addresses and device IDs for each card
  • Server ID specification in connector extra config at startup for instance information retrieval

nPmD: Given that the community's nPmD design, particularly the router component API, is still evolving, we've implemented a solution using a meta server component (to be provided separately) that:

  • Records prefill completion details (device and dp rank information)
  • Responds to decode node queries with prefill node locations
  • Enables decode nodes to retrieve data from appropriate prefill nodes

We propose initially merging the 1P1D implementation, where the global rank table contains information for two nodes, allowing direct prefill node identification. The nPmD implementation can be refined and merged following community discussion.

Todo:

  • Implement 1P1D (one prefill, one decode) configuration support
  • Add sample script for automatic global rank table generation
  • Document global rank table format specifications
  • Provide user guide for PD (Prefill-Decode) functionality
  • Add test cases

re #448


Note:
A minor modification to vLLM's codebase is required to run this example successfully. The patch enables the scheduler process to locate the appropriate connector class by importing the necessary module.

The change should be made in vllm/v1/core/sched/scheduler.py, adding an import statement for vllm_ascend.distributed.

This is a temporary solution, and we need to implement a more elegant module discovery mechanism.

diff --git a/vllm/v1/core/sched/scheduler.py b/vllm/v1/core/sched/scheduler.py
index 69e7cc8ee..b15525971 100644
--- a/vllm/v1/core/sched/scheduler.py
+++ b/vllm/v1/core/sched/scheduler.py
@@ -31,6 +31,8 @@ from vllm.v1.structured_output import StructuredOutputManager
 
 logger = init_logger(__name__)
 
+# TODO(jianzs): Find the suitable place to put this.
+import vllm_ascend.distributed
 
 class Scheduler(SchedulerInterface):
 

Limits:

  1. We use a hash function (string_to_int64_hash) to convert request IDs to datadist request IDs. This conversion is lossy, potentially creating duplicate IDs, leading to duplicate CacheKeys and allocate_cache failures.

@jianzs jianzs force-pushed the zhengsj/datadist-conn-v1 branch 3 times, most recently from 2208ee4 to eb2591c Compare April 28, 2025 03:20
@jianzs jianzs changed the title [WIP][Feature] Impl the connector based on the llmdatadist for v1 [Feature] Impl the connector based on the llmdatadist for v1 Apr 28, 2025
@jianzs jianzs force-pushed the zhengsj/datadist-conn-v1 branch from 01e6bd3 to 70e4719 Compare April 28, 2025 07:05
@jianzs jianzs force-pushed the zhengsj/datadist-conn-v1 branch 4 times, most recently from f984e3b to c2318fe Compare May 6, 2025 00:59
@jianzs
Copy link
Collaborator Author

jianzs commented May 6, 2025

Measure the time it takes for KV transfers at different sequence lengths.

Environment:

  1. Two machines
  2. 1P1D
  3. DeepSeek R1 W8A8
  4. TP=16,DP=1

Stacked charts show higher times than overall charts because each stage of measurement has NPU synchronization before and after, especially extract kv, scatter update, and inject kv, all of which synchronize on each layer, resulting in significant host overhead.

  • load kv = pull cache + inject kv
  • save kv = extract kv + scatter update

image

@jianzs jianzs force-pushed the zhengsj/datadist-conn-v1 branch 3 times, most recently from 0c21251 to 247e501 Compare May 7, 2025 14:19
@Kevin-XiongC
Copy link

Hi, I tried this PR. But it seems like there exists a precision issue here. The prompt is "What is the largest animal in the world?" with temperature == 0 using qwen2.5 0.5b.

using PD disaggregated

{"id":"cmpl-request2005","object":"text_completion","created":1746675979,"model":"/deepseek/Qwen2.5-0.5B-Instruct","choices":[{"index":0,"text":" T of the\nTheorem's Conclusion.\nIn this case, we have to find the value that will make the equation true. We can do so by substituting the given values into the original equation","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":10,"total_tokens":50,"completion_tokens":40,"prompt_tokens_details":null}}

normal

{"id":"cmpl-request16718","object":"text_completion","created":1746676100,"model":"/deepseek/Qwen2.5-0.5B-Instruct","choices":[{"index":0,"text":" The largest animal in the world, as of 2021, was the blue whale. Blue whales can grow up to weigh over 150 tons and be up to 30","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":9,"total_tokens":49,"completion_tokens":40,"prompt_tokens_details":null}}

@jianzs jianzs force-pushed the zhengsj/datadist-conn-v1 branch from 247e501 to 974b99e Compare May 8, 2025 06:22
@jianzs
Copy link
Collaborator Author

jianzs commented May 8, 2025

Hi, I tried this PR. But it seems like there exists a precision issue here. The prompt is "What is the largest animal in the world?" with temperature == 0 using qwen2.5 0.5b.

using PD disaggregated

{"id":"cmpl-request2005","object":"text_completion","created":1746675979,"model":"/deepseek/Qwen2.5-0.5B-Instruct","choices":[{"index":0,"text":" T of the\nTheorem's Conclusion.\nIn this case, we have to find the value that will make the equation true. We can do so by substituting the given values into the original equation","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":10,"total_tokens":50,"completion_tokens":40,"prompt_tokens_details":null}}

normal

{"id":"cmpl-request16718","object":"text_completion","created":1746676100,"model":"/deepseek/Qwen2.5-0.5B-Instruct","choices":[{"index":0,"text":" The largest animal in the world, as of 2021, was the blue whale. Blue whales can grow up to weigh over 150 tons and be up to 30","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":9,"total_tokens":49,"completion_tokens":40,"prompt_tokens_details":null}}

Thank you for reporting this issue. I've tested with DeepSeek v2 Lite and Llama2 7B, and observed that:

  • The first response is incorrect
  • Second and subsequent responses are as expected

Could you confirm if you're seeing incorrect responses consistently in your tests? And are your configurations, including parallelism, identical in both the disaggregated and singlestone environments?

@Kevin-XiongC
Copy link

Kevin-XiongC commented May 8, 2025

Yes, I see this consistently.

I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b.

and as the aggregated one. I've used the default settings below.

 python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct

as for the ranktable

{
    "server_group_list":[
        {
            "group_id": "0",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "1",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-0",
                    "device": [
                        {
                            "device_id": "0",
                            "device_ip": "172.22.17.1",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "2",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-1",
                    "device": [
                        {
                            "device_id": "4",
                            "device_ip": "172.22.17.5",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        }
    ]
}

@jianzs
Copy link
Collaborator Author

jianzs commented May 8, 2025

Yes, I see this consistently.

I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b.

and as the aggregated one. I've used the default settings below.

 python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct

as for the ranktable

{
    "server_group_list":[
        {
            "group_id": "0",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "1",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-0",
                    "device": [
                        {
                            "device_id": "0",
                            "device_ip": "172.22.17.1",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "2",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-1",
                    "device": [
                        {
                            "device_id": "4",
                            "device_ip": "172.22.17.5",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        }
    ]
}

I fixed an accuracy issue. Please try again.

@Kevin-XiongC
Copy link

Yes, I see this consistently.
I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b.
and as the aggregated one. I've used the default settings below.

 python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct

as for the ranktable

{
    "server_group_list":[
        {
            "group_id": "0",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "1",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-0",
                    "device": [
                        {
                            "device_id": "0",
                            "device_ip": "172.22.17.1",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "2",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-1",
                    "device": [
                        {
                            "device_id": "4",
                            "device_ip": "172.22.17.5",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        }
    ]
}

I fixed an accuracy issue. Please try again.

Hi, thx for your work. Unfortunately, it still produces inconsistent results with the 0.5B model, but when I switch to 1.5B model, the disaggregated version produces the correct one. I hope it helps

@jianzs
Copy link
Collaborator Author

jianzs commented May 9, 2025

Yes, I see this consistently.
I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b.
and as the aggregated one. I've used the default settings below.

 python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct

as for the ranktable

{
    "server_group_list":[
        {
            "group_id": "0",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "1",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-0",
                    "device": [
                        {
                            "device_id": "0",
                            "device_ip": "172.22.17.1",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "2",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-1",
                    "device": [
                        {
                            "device_id": "4",
                            "device_ip": "172.22.17.5",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        }
    ]
}

I fixed an accuracy issue. Please try again.

Hi, thx for your work. Unfortunately, it still produces inconsistent results with the 0.5B model, but when I switch to 1.5B model, the disaggregated version produces the correct one. I hope it helps

Thx. Fix a bug, please try again...

@jianzs jianzs force-pushed the zhengsj/datadist-conn-v1 branch from 7863de0 to 03ae1bf Compare May 9, 2025 02:58
@Kevin-XiongC
Copy link

Yes, I see this consistently.
I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b.
and as the aggregated one. I've used the default settings below.

 python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct

as for the ranktable

{
    "server_group_list":[
        {
            "group_id": "0",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "1",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-0",
                    "device": [
                        {
                            "device_id": "0",
                            "device_ip": "172.22.17.1",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        },
        {
            "group_id": "2",
            "version": "1.0",
            "server_count": "1",
            "server_list": [
                {
                    "server_ip": "10.172.116.166",
                    "server_id": "server-1",
                    "device": [
                        {
                            "device_id": "4",
                            "device_ip": "172.22.17.5",
                            "rank_id": "0"
                        }
                    ],
                    "container_ip": "10.172.116.166"
                }
            ],
            
            "status": "completed"
        }
    ]
}

I fixed an accuracy issue. Please try again.

Hi, thx for your work. Unfortunately, it still produces inconsistent results with the 0.5B model, but when I switch to 1.5B model, the disaggregated version produces the correct one. I hope it helps

Thx. Fix a bug, please try again...

Great! It works for me now.

self.num_layers, kv_cache_shape, kv_hidden_dtype)
self._attach_kv_buffer(kv_buffer)

target_tp_rank = self.tp_rank % min(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why mod the min size of the tp size of pd? Can't it just adopt the tp rank.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design originally aimed to support heterogeneous parallelism between prefill and decode phases. For scenarios where prefill TP size < decode TP size, each rank could determine its connection count using the modulo method.

However, due to current LLMDataDist constraints, decode TP size must be ≤ prefill TP size. Consequently, using either modulo operation or direct TP rank assignment achieves identical results.

jianzs added 4 commits May 11, 2025 10:56
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Ensure correct input for npu_reshape_and_cache function

The 'slot_indices' parameter of npu_reshape_and_cache must be:
- A torch.int32 tensor
- Located on the NPU device

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
jianzs added 10 commits May 11, 2025 10:56
Eliminates the need to launch the meta server in the 1p1d environment.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_rank": 0,
"kv_parallel_size": 2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What dose this kv_parallel_size do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v0 implementation needed this, but I'm unsure if it's still necessary.

@jianzs jianzs force-pushed the zhengsj/datadist-conn-v1 branch from 0e82eb4 to 5c752ed Compare May 11, 2025 02:58
@ganyi1996ppo
Copy link
Collaborator

The code looks good to me in general, but I'm not very familiar with the llmdatadist. Can @whx-sjtu review this PR for some of its detail?

device_ip: str
dp_rank: int
tp_rank: int
cluster_id: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to add a new member super_device_id if you want to run disaggregated-prefill on A3 super node.

datadist_request_id = string_to_int64_hash(request.request_id)
kv_cache_key = llm_datadist.CacheKey(remote_cluster_id,
datadist_request_id, 1)
self.llm_datadist_engine.kv_transfer.pull_cache(
Copy link

@mjp9527 mjp9527 May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! I 'm confused about how to implement prefix caching. If enable prefix caching, using decode instance to run remaining prompt?
image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for reporting. Indeed, I haven't considered prefix_cache, and don't know if it's compatible.

@@ -0,0 +1,85 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a quick ReadMe in disaggregated-prefill-v1 folder to describe how to use the example, i.e. run bash disaggregated_prefill_multi_prefill.sh and then xxxx

if self.role == llm_datadist.LLMRole.PROMPT:
options["llm.listenIpInfo"] = f"{self.local_device_ip}:26000"
self.datadist_engine.init(options)
self.kv_transfer = self.datadist_engine.kv_cache_manager
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, we're working on replacing kv_cache_manager to cache_manager, @zzzzwwjj

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, we're working on replacing kv_cache_manager to cache_manager, @zzzzwwjj

@wangxiyuan
Hi, I am playing around with cache_manager right now. But I find difficulties running the demo in https://gitee.com/ascend/samples/blob/master/python/level1_single_api/10_llm_data_dist/cache_manager_api_samples/pull_blocks_sample.py

with error below

llm_datadist.status.LLMException: [link] failed, error code is LLMStatusCode.LLM_LINK_FAILED, {1: 0, 2: 1}.

I am wondering if this this class is available in a single node case. In addition I added environment variable as the documentation stated here.
image

axis=-2)

# Release reference count
self.llm_datadist_engine.kv_transfer.deallocate_cache(kv_buffer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very familiar with llmdatadist, why deallocate_cache once you scatter_update the kv cache into datadist_kv_cache? the npu tensor you allocated seems based on this kv_buffer's address.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A newly allocated KV buffer has two references: one by the Cache ID, which is released by deallocate_cache; the other by the Cache Key, automatically released when a pull cache request is received. This deallocate_cache call releases the Cache ID reference, ensuring the KV buffer is automatically freed when a pull cache request arrives.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds reasonable, that means the life cycle is totally maintained by llmdatadist, when we allocate cache with cachekey right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got, thanks for the explaination

jianzs added 3 commits May 14, 2025 12:08
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
@jianzs jianzs force-pushed the zhengsj/datadist-conn-v1 branch from ea7522f to 540a57e Compare May 22, 2025 06:21
@jianzs jianzs changed the title [Feature] Impl the connector based on the llmdatadist for v1 [Feature][1/2] Impl the connector based on the llmdatadist for v1 May 22, 2025
@jianzs
Copy link
Collaborator Author

jianzs commented May 22, 2025

@wangxiyuan @Yikun Sorry, I've been swamped lately and haven't had time to merge the PR yet. The vLLM connector API changed. Let's merge this PR first, then I'll submit a new one for the new API as soon as possible.

@jianzs jianzs added the ready read for review label May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:core ready read for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants