Open
Description
🚀 The feature, motivation and pitch
Hi I want to run latest Embedding models, eg Qwen/Qwen3-Embedding-0.6B
, on TPU nodes. I found that although vLLM has support on TPU it does not really support embedding models since the only available attention implementation on TPU is PALLAS
which is DECODER only.
vllm/vllm/v1/attention/backends/pallas.py
Lines 164 to 168 in 99b4f08
Meanwhile, Qwen3 Embedding is ENCODER-only so it can't run on TPU.
vllm/vllm/model_executor/models/qwen3.py
Lines 166 to 173 in 99b4f08
It will be nice if we can support Qwen3 Embedding on TPU,
Alternatives
I am trying to use Qwen3 Embedding via transformers
but it's not as performant as vLLM.
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.