Hi teams, thanks for this amazing project! I have successfully trained a lightweight LSTM speculator for the text-only Qwen2.5 models and integrated it with Arctic Inference. And the speedup it acheived is so great!
Now I’d like to achieve similar acceleration for Qwen2.5-VL-32B. After briefly scanning the VLM, I see that:
The vision encoder outputs a sequence of image tokens that are concatenated with text tokens before feeding the LLM backbone.
The existing speculative-decoding utilities seem to assume a pure text input.
Questions
- Is it technically feasible to extend the current LSTM speculative decoder to handle the interleaved “image + text” token stream of Qwen2.5-VL-32B?
- If yes, how can I train this LSTM speculator?
If the above is not recommended, would the maintainers suggest:
- Using a small Qwen2.5-VL model (e.g., 3B) as the draft model instead, or any other solutions?