Skip to content

[Feature] Support the v1 connector API #605

Open
@jianzs

Description

@jianzs

🚀 The feature, motivation and pitch

The vLLM community has merged the v1 Connector API (vllm-project/vllm#15960). However, it is currently not functional in the Ascend environment. The issue lies in the Attention module where there's a flag use_direct_call. When this flag is set to True, the execution branch skips two layer kv cache APIs, making it impossible to implement layer-wise kv transfer. These two APIs are only called within unified attention functions.

The use_direct_call is determined by the following condition in the source code:

https://github.com/vllm-project/vllm/blob/1311913f5537b36a7b12f481ebd15f7ad775db58/vllm/attention/layer.py#L140-L145

Since the Ascend platform is neither considered as cuda nor cpu, use_direct_call is set to True, leading to an unsupported execution branch.

Has anyone encountered similar issues when working with graph mode on the Ascend platform? What would be the recommended solution?

This appears to be a common issue when using the v1 connector API on custom platforms, suggesting it might be an inherent issue within vLLM. However, I haven't been able to identify a suitable solution. For more context about this issue, please refer to this PR in the vLLM community: vllm-project/vllm#16921

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions