[Feature] Support the v1 connector API

### 🚀 The feature, motivation and pitch

The vLLM community has merged the v1 Connector API (https://github.com/vllm-project/vllm/pull/15960). However, it is currently not functional in the Ascend environment. The issue lies in the Attention module where there's a flag `use_direct_call`. When this flag is set to True, the execution branch skips two layer kv cache APIs, making it impossible to implement layer-wise kv transfer. These two APIs are only called within unified attention functions.

The `use_direct_call` is determined by the following condition in the source code:

https://github.com/vllm-project/vllm/blob/1311913f5537b36a7b12f481ebd15f7ad775db58/vllm/attention/layer.py#L140-L145

Since the Ascend platform is neither considered as cuda nor cpu, `use_direct_call` is set to True, leading to an unsupported execution branch.

Has anyone encountered similar issues when working with graph mode on the Ascend platform? What would be the recommended solution?

This appears to be a common issue when using the v1 connector API on custom platforms, suggesting it might be an inherent issue within vLLM. However, I haven't been able to identify a suitable solution. For more context about this issue, please refer to this PR in the vLLM community: https://github.com/vllm-project/vllm/pull/16921

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support the v1 connector API #605

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Support the v1 connector API #605

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions