Optimize V1 FlashInfer backend to use CPU host buffers

LucasWilkinson · LucasWilkinson · commit d79aed63d4a4 · 2025-07-16T23:04:27.000-04:00
- Replace GPU-to-CPU transfers with direct CPU tensor construction
- Build planning tensors from existing CommonAttentionMetadata CPU buffers
- Reduce from 6x to 1x .cpu() calls during FlashInfer planning
- Fix test mocks to handle correct argument count
- Maintain compatibility with GPUModelRunner and FlashInfer V1 backend

Signed-off-by: Lucas Wilkinson &lt;lwilkins@redhat.com&gt;
diff --git a/tests/v1/attention/test_attention_backends.py b/tests/v1/attention/test_attention_backends.py
@@ -212,7 +212,7 @@ def run_attention_backend(backend: _Backend, kv_cache_spec: FullAttentionSpec,
 
         from vllm.v1.attention.backends.flashinfer import PerLayerParameters
 
-        def mock_get_per_layer_parameters(vllm_config):
+        def mock_get_per_layer_parameters(vllm_config, impl_cls):
             # Return mock parameters for a single layer
             head_size = vllm_config.model_config.get_head_size()
             return {