-
Notifications
You must be signed in to change notification settings - Fork 157
[Feature][1/2] Impl the connector based on the llmdatadist for v1 #684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2208ee4
to
eb2591c
Compare
01e6bd3
to
70e4719
Compare
f984e3b
to
c2318fe
Compare
Measure the time it takes for KV transfers at different sequence lengths. Environment:
Stacked charts show higher times than overall charts because each stage of measurement has NPU synchronization before and after, especially extract kv, scatter update, and inject kv, all of which synchronize on each layer, resulting in significant host overhead.
|
0c21251
to
247e501
Compare
Hi, I tried this PR. But it seems like there exists a precision issue here. The prompt is "What is the largest animal in the world?" with temperature == 0 using qwen2.5 0.5b. using PD disaggregated
normal
|
247e501
to
974b99e
Compare
Thank you for reporting this issue. I've tested with DeepSeek v2 Lite and Llama2 7B, and observed that:
Could you confirm if you're seeing incorrect responses consistently in your tests? And are your configurations, including parallelism, identical in both the disaggregated and singlestone environments? |
Yes, I see this consistently. I used the same sh script 'disaggregated_prefill_multi_prefill.sh' but changed its tp to 1 and model to qwen2.5 0.5b. and as the aggregated one. I've used the default settings below. python -m vllm.entrypoints.openai.api_server --model Qwen2.5-0.5B-Instruct as for the ranktable {
"server_group_list":[
{
"group_id": "0",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_ip": "10.172.116.166",
"container_ip": "10.172.116.166"
}
],
"status": "completed"
},
{
"group_id": "1",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_ip": "10.172.116.166",
"server_id": "server-0",
"device": [
{
"device_id": "0",
"device_ip": "172.22.17.1",
"rank_id": "0"
}
],
"container_ip": "10.172.116.166"
}
],
"status": "completed"
},
{
"group_id": "2",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_ip": "10.172.116.166",
"server_id": "server-1",
"device": [
{
"device_id": "4",
"device_ip": "172.22.17.5",
"rank_id": "0"
}
],
"container_ip": "10.172.116.166"
}
],
"status": "completed"
}
]
} |
I fixed an accuracy issue. Please try again. |
Hi, thx for your work. Unfortunately, it still produces inconsistent results with the 0.5B model, but when I switch to 1.5B model, the disaggregated version produces the correct one. I hope it helps |
Thx. Fix a bug, please try again... |
7863de0
to
03ae1bf
Compare
Great! It works for me now. |
self.num_layers, kv_cache_shape, kv_hidden_dtype) | ||
self._attach_kv_buffer(kv_buffer) | ||
|
||
target_tp_rank = self.tp_rank % min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why mod the min size of the tp size of pd? Can't it just adopt the tp rank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design originally aimed to support heterogeneous parallelism between prefill and decode phases. For scenarios where prefill TP size < decode TP size, each rank could determine its connection count using the modulo method.
However, due to current LLMDataDist constraints, decode TP size must be ≤ prefill TP size. Consequently, using either modulo operation or direct TP rank assignment achieves identical results.
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Ensure correct input for npu_reshape_and_cache function The 'slot_indices' parameter of npu_reshape_and_cache must be: - A torch.int32 tensor - Located on the NPU device Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Eliminates the need to launch the meta server in the 1p1d environment. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
"kv_buffer_device": "npu", | ||
"kv_role": "kv_producer", | ||
"kv_rank": 0, | ||
"kv_parallel_size": 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What dose this kv_parallel_size
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The v0 implementation needed this, but I'm unsure if it's still necessary.
0e82eb4
to
5c752ed
Compare
The code looks good to me in general, but I'm not very familiar with the llmdatadist. Can @whx-sjtu review this PR for some of its detail? |
device_ip: str | ||
dp_rank: int | ||
tp_rank: int | ||
cluster_id: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to add a new member super_device_id if you want to run disaggregated-prefill on A3 super node.
datadist_request_id = string_to_int64_hash(request.request_id) | ||
kv_cache_key = llm_datadist.CacheKey(remote_cluster_id, | ||
datadist_request_id, 1) | ||
self.llm_datadist_engine.kv_transfer.pull_cache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for reporting. Indeed, I haven't considered prefix_cache, and don't know if it's compatible.
@@ -0,0 +1,85 @@ | |||
# SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a quick ReadMe in disaggregated-prefill-v1
folder to describe how to use the example, i.e. run bash disaggregated_prefill_multi_prefill.sh
and then xxxx
if self.role == llm_datadist.LLMRole.PROMPT: | ||
options["llm.listenIpInfo"] = f"{self.local_device_ip}:26000" | ||
self.datadist_engine.init(options) | ||
self.kv_transfer = self.datadist_engine.kv_cache_manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, we're working on replacing kv_cache_manager
to cache_manager
, @zzzzwwjj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, we're working on replacing
kv_cache_manager
tocache_manager
, @zzzzwwjj
@wangxiyuan
Hi, I am playing around with cache_manager right now. But I find difficulties running the demo in https://gitee.com/ascend/samples/blob/master/python/level1_single_api/10_llm_data_dist/cache_manager_api_samples/pull_blocks_sample.py
with error below
llm_datadist.status.LLMException: [link] failed, error code is LLMStatusCode.LLM_LINK_FAILED, {1: 0, 2: 1}.
I am wondering if this this class is available in a single node case. In addition I added environment variable as the documentation stated here.
axis=-2) | ||
|
||
# Release reference count | ||
self.llm_datadist_engine.kv_transfer.deallocate_cache(kv_buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very familiar with llmdatadist, why deallocate_cache
once you scatter_update the kv cache into datadist_kv_cache? the npu tensor you allocated seems based on this kv_buffer's address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A newly allocated KV buffer has two references: one by the Cache ID, which is released by deallocate_cache
; the other by the Cache Key, automatically released when a pull cache request is received. This deallocate_cache
call releases the Cache ID reference, ensuring the KV buffer is automatically freed when a pull cache request arrives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds reasonable, that means the life cycle is totally maintained by llmdatadist, when we allocate cache with cachekey right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got, thanks for the explaination
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
ea7522f
to
540a57e
Compare
@wangxiyuan @Yikun Sorry, I've been swamped lately and haven't had time to merge the PR yet. The vLLM connector API changed. Let's merge this PR first, then I'll submit a new one for the new API as soon as possible. |
This PR implements the connector functionality for NPU based on LLMDataDist, building upon the connector API merged in vLLM v1. (vllm-project/vllm#15960) We've successfully tested various scenarios in offline environments:
Key implementation aspects include:
Cross-machine PD: LLMDataDist requires NPU device IP for connection establishment. Our approach utilizes a global rank table (JSON) on each machine containing:
nPmD: Given that the community's nPmD design, particularly the router component API, is still evolving, we've implemented a solution using a meta server component (to be provided separately) that:
We propose initially merging the 1P1D implementation, where the global rank table contains information for two nodes, allowing direct prefill node identification. The nPmD implementation can be refined and merged following community discussion.
Todo:
re #448
Note:
A minor modification to vLLM's codebase is required to run this example successfully. The patch enables the scheduler process to locate the appropriate connector class by importing the necessary module.
The change should be made in
vllm/v1/core/sched/scheduler.py
, adding an import statement forvllm_ascend.distributed
.This is a temporary solution, and we need to implement a more elegant module discovery mechanism.
Limits:
string_to_int64_hash
) to convert request IDs to datadist request IDs. This conversion is lossy, potentially creating duplicate IDs, leading to duplicateCacheKey
s andallocate_cache
failures.