How to pass vLLM inference server error messages to GUI / open-webui users? #10857

c-hoffmann · 2024-12-03T11:11:47Z

c-hoffmann
Dec 3, 2024

I run vLLM and open-webui, both installed via pip instead of docker.

My issue is that vLLM error messages don't get passed through to the open-webui user as expected. An easy way to show this is to have a conversation that exceeds the max_model_len / context length. Here's my vLLM output when this occurs:

[...]utilizzarle?\n\n**Critica e prospettiva**\n\nLa storia di Max e ECHO offre una visione ottimistica di un futuro in cui uomo e macchina coesistono in armonia. Tuttavia, non dobbiamo dimenticare che questa armonia si basa su una solida infrastruttura sociale ed economica, non accessibile a tutti. È nostra responsabilità considerare le implicazioni sociali e etiche nello sviluppo e nell\'utilizzo di queste tecnologie, per creare una società più inclusiva e giusta per tutti.\n\n**Conclusione**\n\n"L\'amicizia insolita" tra Max e ECHO è più di una storia calda; è uno specchio della nostra società che ci costringe a riflettere sulle nostre relazioni con la tecnologia, il lavoro e l\'identità. Esaminando criticamente questi temi, possiamo sperare di plasmare un futuro in cui umanità e tecnologia siano in armonia, senza sacrificare l\'essenza della nostra umanità.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nNow to Romanian<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=166, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     ::1:58562 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-03 11:59:56 engine.py:267] Added request chatcmpl-40547d68f5ab4a6589681bad66deef2c.
INFO 12-03 11:59:56 metrics.py:449] Avg prompt throughput: 326.8 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 22.6%, CPU KV cache usage: 0.0%.
INFO:     ::1:58550 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-03 12:00:10 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 98.8%, CPU KV cache usage: 0.0%.
INFO 12-03 12:00:15 metrics.py:449] Avg prompt throughput: 2965.8 tokens/s, Avg generation throughput: 7.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%.
INFO 12-03 12:00:20 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 99.7%, CPU KV cache usage: 0.0%.
INFO:     ::1:51666 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     ::1:51670 - "GET /v1/models HTTP/1.1" 200 OK
ERROR 12-03 12:00:27 serving_chat.py:170] Error in preprocessing prompt inputs
ERROR 12-03 12:00:27 serving_chat.py:170] Traceback (most recent call last):
ERROR 12-03 12:00:27 serving_chat.py:170]   File "/home/myuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 155, in create_chat_completion
ERROR 12-03 12:00:27 serving_chat.py:170]     ) = await self._preprocess_chat(
ERROR 12-03 12:00:27 serving_chat.py:170]   File "/home/myuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_engine.py", line 487, in _preprocess_chat
ERROR 12-03 12:00:27 serving_chat.py:170]     prompt_inputs = self._tokenize_prompt_input(
ERROR 12-03 12:00:27 serving_chat.py:170]   File "/home/myuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_engine.py", line 324, in _tokenize_prompt_input
ERROR 12-03 12:00:27 serving_chat.py:170]     return next(
ERROR 12-03 12:00:27 serving_chat.py:170]   File "/home/myuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_engine.py", line 347, in _tokenize_prompt_inputs
ERROR 12-03 12:00:27 serving_chat.py:170]     yield self._normalize_prompt_text_to_input(
ERROR 12-03 12:00:27 serving_chat.py:170]   File "/home/myuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_engine.py", line 243, in _normalize_prompt_text_to_input
ERROR 12-03 12:00:27 serving_chat.py:170]     return self._validate_input(request, input_ids, input_text)
ERROR 12-03 12:00:27 serving_chat.py:170]   File "/home/myuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_engine.py", line 296, in _validate_input
ERROR 12-03 12:00:27 serving_chat.py:170]     raise ValueError(
ERROR 12-03 12:00:27 serving_chat.py:170] ValueError: This model's maximum context length is 15100 tokens. However, you requested 15114 tokens in the messages, Please reduce the length of the messages.
INFO:     ::1:51680 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

And here's what open-webui shows:

How can I pass the issue to the GUI more appropriately? Perhaps in this case telling the user to start a new conversation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to pass vLLM inference server error messages to GUI / open-webui users? #10857

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

How to pass vLLM inference server error messages to GUI / open-webui users? #10857

Uh oh!

c-hoffmann Dec 3, 2024

Replies: 0 comments

c-hoffmann
Dec 3, 2024