how does vllm handle wrong tokens in speculative decoding? #4284
Replies: 2 comments 4 replies
-
Before the engine appends token ids to sequences, it removes -1 tokens. The logic is here: |
Beta Was this translation helpful? Give feedback.
-
these lines suggests only the last token id would be appended into "input_ids". For example, in the last sentence, I don't understand how target model compute the key & value for 1, 2, 3. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The core problem is, accepted tokens in different sentences of same batch are different.
I tried to search a lot about batched speculative decoding and I found most inference frameworks ignore this problem by simply setting batch size == 1, except vllm. I read the code in vllm/spec_decode, and it seems vllm fill all wrong tokens with
-1
. I am still unclear about:thanks.
Beta Was this translation helpful? Give feedback.
All reactions