Skip to content

[Feature] Support the v1 connector API #605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jianzs opened this issue Apr 22, 2025 · 4 comments
Open

[Feature] Support the v1 connector API #605

jianzs opened this issue Apr 22, 2025 · 4 comments

Comments

@jianzs
Copy link
Collaborator

jianzs commented Apr 22, 2025

🚀 The feature, motivation and pitch

The vLLM community has merged the v1 Connector API (vllm-project/vllm#15960). However, it is currently not functional in the Ascend environment. The issue lies in the Attention module where there's a flag use_direct_call. When this flag is set to True, the execution branch skips two layer kv cache APIs, making it impossible to implement layer-wise kv transfer. These two APIs are only called within unified attention functions.

The use_direct_call is determined by the following condition in the source code:

https://github.com/vllm-project/vllm/blob/1311913f5537b36a7b12f481ebd15f7ad775db58/vllm/attention/layer.py#L140-L145

Since the Ascend platform is neither considered as cuda nor cpu, use_direct_call is set to True, leading to an unsupported execution branch.

Has anyone encountered similar issues when working with graph mode on the Ascend platform? What would be the recommended solution?

This appears to be a common issue when using the v1 connector API on custom platforms, suggesting it might be an inherent issue within vLLM. However, I haven't been able to identify a suitable solution. For more context about this issue, please refer to this PR in the vLLM community: vllm-project/vllm#16921

Alternatives

No response

Additional context

No response

@wangxiyuan
Copy link
Collaborator

After this PR #426 the function unified_ascend_attention_with_output is added to vllm-ascend. I think we can find a way to reuse it for V1 connector as well.

@jianzs
Copy link
Collaborator Author

jianzs commented Apr 27, 2025

After this PR #426 the function unified_ascend_attention_with_output is added to vllm-ascend. I think we can find a way to reuse it for V1 connector as well.

Got it. After merging #684 (request-wise KV transfer implementation), we can work on the layer-wise version.

@mjp9527
Copy link

mjp9527 commented May 12, 2025

Great job! I've tested PR 684 on DeepSeek-V2 and observed the following two issues:
1 instance tp1 on 910B

Image

2 instance 1p1d tp1tp1 on 910B

Image

a. Significant increase in TTFT
b. Crashes under high concurrency with PD separation, such as 256

Additionally, I noticed that the current implementation relies on a synchronous interface. I'm wondering if there are any progress to support layer-wise asynchronous communication and prefix caching.

@jianzs
Copy link
Collaborator Author

jianzs commented May 12, 2025

@mjp9527 Thanks! Looks like a significant performance issue. Let's work together to resolve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants