Description
🚀 The feature, motivation and pitch
The vLLM community has merged the v1 Connector API (vllm-project/vllm#15960). However, it is currently not functional in the Ascend environment. The issue lies in the Attention module where there's a flag use_direct_call
. When this flag is set to True, the execution branch skips two layer kv cache APIs, making it impossible to implement layer-wise kv transfer. These two APIs are only called within unified attention functions.
The use_direct_call
is determined by the following condition in the source code:
Since the Ascend platform is neither considered as cuda nor cpu, use_direct_call
is set to True, leading to an unsupported execution branch.
Has anyone encountered similar issues when working with graph mode on the Ascend platform? What would be the recommended solution?
This appears to be a common issue when using the v1 connector API on custom platforms, suggesting it might be an inherent issue within vLLM. However, I haven't been able to identify a suitable solution. For more context about this issue, please refer to this PR in the vLLM community: vllm-project/vllm#16921
Alternatives
No response
Additional context
No response