-
Notifications
You must be signed in to change notification settings - Fork 159
[Feature] Support the v1 connector API #605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
After this PR #426 the function |
Great job! I've tested PR 684 on DeepSeek-V2 and observed the following two issues: 2 instance 1p1d tp1tp1 on 910B a. Significant increase in TTFT Additionally, I noticed that the current implementation relies on a synchronous interface. I'm wondering if there are any progress to support layer-wise asynchronous communication and prefix caching. |
@mjp9527 Thanks! Looks like a significant performance issue. Let's work together to resolve it. |
🚀 The feature, motivation and pitch
The vLLM community has merged the v1 Connector API (vllm-project/vllm#15960). However, it is currently not functional in the Ascend environment. The issue lies in the Attention module where there's a flag
use_direct_call
. When this flag is set to True, the execution branch skips two layer kv cache APIs, making it impossible to implement layer-wise kv transfer. These two APIs are only called within unified attention functions.The
use_direct_call
is determined by the following condition in the source code:https://github.com/vllm-project/vllm/blob/1311913f5537b36a7b12f481ebd15f7ad775db58/vllm/attention/layer.py#L140-L145
Since the Ascend platform is neither considered as cuda nor cpu,
use_direct_call
is set to True, leading to an unsupported execution branch.Has anyone encountered similar issues when working with graph mode on the Ascend platform? What would be the recommended solution?
This appears to be a common issue when using the v1 connector API on custom platforms, suggesting it might be an inherent issue within vLLM. However, I haven't been able to identify a suitable solution. For more context about this issue, please refer to this PR in the vLLM community: vllm-project/vllm#16921
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: