Implementing Logit Functionality in vLLM #1587

bardia-mhd · 2023-11-07T21:40:56Z

bardia-mhd
Nov 7, 2023

Hello,

I have been working with the vLLM project and I am interested in implementing a feature to make inference faster. Specifically, I want to generate probabilities for the output, similar to a discussion I found on Hugging Face (you can find the discussion here).

However, I noticed that the vLLM model doesn’t provide logits for the previous token of the prompts, it only gives the generated token logits. Here is the specific part of the code I am referring to

input_ids = tokenizer(input_texts, padding=True, return_tensors="pt").input_ids
outputs = model(input_ids)
probs = torch.log_softmax(outputs.logits, dim=-1).detach()

I am seeking guidance on how I could modify or utilize the vLLM source code to get this functionality. Any help or direction would be greatly appreciated

simon-mo · 2023-11-07T21:50:39Z

simon-mo
Nov 7, 2023
Maintainer

The prompt_logprobs parameters in SamplingParams might be helpful here: https://github.com/vllm-project/vllm/blob/1a2bbc930135cd3b94fbff2aafbdf5c568acc8bd/vllm/sampling_params.py#L79C9-L79C24

3 replies

bardia-mhd Nov 21, 2023
Author

Thank you for your answer. I have another question. If we set the max_tokens=1 and do not generate a sequence and just use a single pass and inference, I still get better performance(faster inference) than naive Hugging Face. I guess it is still related to KV cache optimization, but to utilize this optimization it should be multiple forward pass.

PanAndy Jul 3, 2024

Hello,
I have the same requirement as you do. I would like to use VLLM to compute the logprobs of input_ids. I've tried setting max_tokens to 1, but I found that it takes much longer than the following approach. Do you have any insights into this?

def log_probs_from_logits(logits: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
    log_probs = F.log_softmax(logits, dim=-1)
    log_probs_labels = log_probs.gather(dim=-1, index=labels.unsqueeze(-1))
    return log_probs_labels.squeeze(-1)

Could you provide your thoughts on why this might be?

PanAndy Jul 9, 2024

I've found the cause. When creating the LLM engine, I set the tokenizer using llm.set_tokenizer(tokenizer), which interfered with the caching operation of the tokenizer in the VLLM. This led to frequent calculations of len(tokenizer) during the decode process, resulting in excessive time consumption.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implementing Logit Functionality in vLLM #1587

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Implementing Logit Functionality in vLLM #1587

Uh oh!

bardia-mhd Nov 7, 2023

Replies: 1 comment · 3 replies

Uh oh!

simon-mo Nov 7, 2023 Maintainer

Uh oh!

bardia-mhd Nov 21, 2023 Author

Uh oh!

PanAndy Jul 3, 2024

Uh oh!

PanAndy Jul 9, 2024

bardia-mhd
Nov 7, 2023

Replies: 1 comment 3 replies

simon-mo
Nov 7, 2023
Maintainer

bardia-mhd Nov 21, 2023
Author