-
Notifications
You must be signed in to change notification settings - Fork 229
[Feature][1/2] Impl the connector based on the llmdatadist for v1 #684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2208ee4
to
eb2591c
Compare
01e6bd3
to
70e4719
Compare
f984e3b
to
c2318fe
Compare
Measure the time it takes for KV transfers at different sequence lengths. Environment:
Stacked charts show higher times than overall charts because each stage of measurement has NPU synchronization before and after, especially extract kv, scatter update, and inject kv, all of which synchronize on each layer, resulting in significant host overhead.
|
0c21251
to
247e501
Compare
Hi, I tried this PR. But it seems like there exists a precision issue here. The prompt is "What is the largest animal in the world?" with temperature == 0 using qwen2.5 0.5b. using PD disaggregated
normal
|
247e501
to
974b99e
Compare
Thank you for reporting this issue. I've tested with DeepSeek v2 Lite and Llama2 7B, and observed that:
Could you confirm if you're seeing incorrect responses consistently in your tests? And are your configurations, including parallelism, identical in both the disaggregated and singlestone environments? |
Yes, I see this consistently. I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b. and as the aggregated one. I've used the default settings below. python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct as for the ranktable {
"server_group_list":[
{
"group_id": "0",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_ip": "10.172.116.166",
"container_ip": "10.172.116.166"
}
],
"status": "completed"
},
{
"group_id": "1",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_ip": "10.172.116.166",
"server_id": "server-0",
"device": [
{
"device_id": "0",
"device_ip": "172.22.17.1",
"rank_id": "0"
}
],
"container_ip": "10.172.116.166"
}
],
"status": "completed"
},
{
"group_id": "2",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_ip": "10.172.116.166",
"server_id": "server-1",
"device": [
{
"device_id": "4",
"device_ip": "172.22.17.5",
"rank_id": "0"
}
],
"container_ip": "10.172.116.166"
}
],
"status": "completed"
}
]
} |
I fixed an accuracy issue. Please try again. |
Hi, thx for your work. Unfortunately, it still produces inconsistent results with the 0.5B model, but when I switch to 1.5B model, the disaggregated version produces the correct one. I hope it helps |
Thx. Fix a bug, please try again... |
7863de0
to
03ae1bf
Compare
Great! It works for me now. |
self.num_layers, kv_cache_shape, kv_hidden_dtype) | ||
self._attach_kv_buffer(kv_buffer) | ||
|
||
target_tp_rank = self.tp_rank % min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why mod the min size of the tp size of pd? Can't it just adopt the tp rank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design originally aimed to support heterogeneous parallelism between prefill and decode phases. For scenarios where prefill TP size < decode TP size, each rank could determine its connection count using the modulo method.
However, due to current LLMDataDist constraints, decode TP size must be ≤ prefill TP size. Consequently, using either modulo operation or direct TP rank assignment achieves identical results.
"kv_buffer_device": "npu", | ||
"kv_role": "kv_producer", | ||
"kv_rank": 0, | ||
"kv_parallel_size": 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What dose this kv_parallel_size
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The v0 implementation needed this, but I'm unsure if it's still necessary.
0e82eb4
to
5c752ed
Compare
The code looks good to me in general, but I'm not very familiar with the llmdatadist. Can @whx-sjtu review this PR for some of its detail? |
device_ip: str | ||
dp_rank: int | ||
tp_rank: int | ||
cluster_id: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to add a new member super_device_id if you want to run disaggregated-prefill on A3 super node.
@jianzs IMO, If we can't do e2e test, I prefer to merge after fix the issue. Unless this blocks something. |
@Yikun I saw the connector @ganyi1996ppo submitted. I'm wondering if there's still a need for both of them, as their implementation methods are very similar. |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Ensure correct input for npu_reshape_and_cache function The 'slot_indices' parameter of npu_reshape_and_cache must be: - A torch.int32 tensor - Located on the NPU device Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Eliminates the need to launch the meta server in the 1p1d environment. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
540a57e
to
969159e
Compare
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
This PR implements the connector functionality for NPU based on LLMDataDist, building upon the connector API merged in vLLM v1. (vllm-project/vllm#15960) We've successfully tested various scenarios in offline environments:
Key implementation aspects include:
Cross-machine PD: LLMDataDist requires NPU device IP for connection establishment. Our approach utilizes a global rank table (JSON) on each machine containing:
nPmD: Given that the community's nPmD design, particularly the router component API, is still evolving, we've implemented a solution using a meta server component (to be provided separately) that:
We propose initially merging the 1P1D implementation, where the global rank table contains information for two nodes, allowing direct prefill node identification. The nPmD implementation can be refined and merged following community discussion.
Todo:
re #448
Note:
A minor modification to vLLM's codebase is required to run this example successfully. The patch enables the scheduler process to locate the appropriate connector class by importing the necessary module.
The change should be made in
vllm/v1/core/sched/scheduler.py
, adding an import statement forvllm_ascend.distributed
.This is a temporary solution, and we need to implement a more elegant module discovery mechanism.
Limits:
string_to_int64_hash
) to convert request IDs to datadist request IDs. This conversion is lossy, potentially creating duplicate IDs, leading to duplicateCacheKey
s andallocate_cache
failures.