Replies: 3 comments 4 replies
-
I tried v0.4.2 to v0.6.2, all of them has the same problem |
Beta Was this translation helpful? Give feedback.
-
I changed self.disable_logprobs = True to self.disable_logprobs = False in TargetModelRunner [[SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=50006, logprobs={50006: Logprob(logprob=0.0, rank=1, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=torch.Size([1, 115584]), sampled_token_ids=[[26888]], spec_decode_worker_metrics=None)]] maybe there is something wrong in func _sample_with_torch in sampler.py |
Beta Was this translation helpful? Give feedback.
-
but still, I has another case that is not caused by the sop_token_id |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In my case, ngram speculative got diff result with no speculative decode, and even diff with different ngram-prompt-lookup-min args (1 and 2)
I just run the same model with different vllm args like:
python -m vllm.entrypoints.api_server --host 127.0.0.1 --trust-remote-code --disable-custom-all-reduce --use-v2-block-manager --enable-prefix-caching --max-model-len 3072 --gpu-memory-utilization 0.7 --model /path/to/model/ --port 9122
and
python -m vllm.entrypoints.api_server --host 127.0.0.1 --trust-remote-code --disable-custom-all-reduce --use-v2-block-manager --enable-prefix-caching --max-model-len 3072 --gpu-memory-utilization 0.7 --model /path/to/model/ --port 9122 --speculative-model [ngram] --num-speculative-tokens 6 --ngram-prompt-lookup-max 5 --ngram-prompt-lookup-min 2
is it should got the same result when I use the same query with "temperature" == 0 ?
Beta Was this translation helpful? Give feedback.
All reactions