Skip to content

Commit d79aed6

Browse files
Optimize V1 FlashInfer backend to use CPU host buffers
- Replace GPU-to-CPU transfers with direct CPU tensor construction - Build planning tensors from existing CommonAttentionMetadata CPU buffers - Reduce from 6x to 1x .cpu() calls during FlashInfer planning - Fix test mocks to handle correct argument count - Maintain compatibility with GPUModelRunner and FlashInfer V1 backend Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
1 parent ad021d5 commit d79aed6

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

tests/v1/attention/test_attention_backends.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,7 @@ def run_attention_backend(backend: _Backend, kv_cache_spec: FullAttentionSpec,
212212

213213
from vllm.v1.attention.backends.flashinfer import PerLayerParameters
214214

215-
def mock_get_per_layer_parameters(vllm_config):
215+
def mock_get_per_layer_parameters(vllm_config, impl_cls):
216216
# Return mock parameters for a single layer
217217
head_size = vllm_config.model_config.get_head_size()
218218
return {

0 commit comments

Comments
 (0)