Unable to perform LLM inference when using VllmServer #14420

xKwan · 2024-06-27T03:07:54Z

xKwan
Jun 27, 2024

Issue:
I want to serve LLM application in production mode, thus I am hosting LLM using vLLM and connect documents to LLM using LlamaIndex. When I tried a sample inferencing, I got KeyError: 'text'

Library versions used:
vllm: 0.4.0.post1
llama_index: 0.10.42
llama_index.llms.vllm: 0.1.7

Server Setup:
I installed vllm and started a vllm server with the following command in the terminal:

python3 -m vllm.entrypoints.openai.api_server --model=/path/to/my/local/model --dtype=float16 --tensor-parallel-size=8 --quantization=awq --gpu-memory-utilization=0.7

It is hosted on localhost:8000.

I did a sanity check with curl command:

curl http://localhost:8000/v1/completions\
-H "Content-Type": "application/json" \
-d '{
    "model": "/path/to/my/local/model",
    "prompt" "What is a black hole?",
    "max_tokens": 1024,
    "temperature": 0,
}'

Application Setup:
I followed the reference guide here:
https://docs.llamaindex.ai/en/stable/api_reference/llms/vllm/#llama_index.llms.vllm.VllmServer

from llama_index.llms.vllm import VllmServer

def messages_to_prompt(messages):
    prompt = "\n".join([str(x) for x in messages])
    return f"<s>[INST] {prompt} [/INST] </s>\n"

def completion_to_prompt(completion):
    return f"<s>[INST] {completion} [/INST] </s>\n"

llm = VllmServer(
    api_url="http://localhost:8000/v1/completions",
    max_new_tokens=256,
    temperature=0.1,
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

llm.complete(
    "What is a black hole?"
)

KeyError: 'text'

Traceback:
File "/......./lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 198, in wrapper
  result = func(*args, **kwargs)
File "/......./lib/python3.10/site-packages/llama_index/core/llms/callbacks.py", line 389, in wrapped_llm_predict
  f_return_val = f(_self, *args, **kwargs)
File "/......./lib/python3.10/site-packages/llama_index/llms/vllm/base.py", line 368, in complete
  output = get_response(response)
File "/......./lib/python3.10/site-packages/llama_index/llms/vllm/utils.py", line 9, in get_response
  return data["text"]
KeyError: 'text'

@xKwan · 2024-06-27T03:15:02Z

dosubot[bot]
bot Jun 27, 2024

Hey @xKwan! I'm here to help with your issue. I can assist with bug fixes, answer questions, and guide you to become a contributor. Let's tackle this challenge together!

To resolve the KeyError: 'text' when using VllmServer for LLM inference, you need to ensure that the response from the post_http_request function contains the expected 'text' key. The issue likely arises because the response does not include this key.

Here's a potential fix for the complete method in your VllmServer setup:

def complete(
    self, prompt: str, formatted: bool = False, **kwargs: Any
) -> CompletionResponse:
    kwargs = kwargs if kwargs else {}
    params = {**self._model_kwargs, **kwargs}

    # build sampling parameters
    sampling_params = dict(**params)
    sampling_params["prompt"] = prompt
    response = post_http_request(self.api_url, sampling_params, stream=False)
    output = get_response(response)

    # Ensure the response contains the 'text' key
    if 'text' not in output[0]:
        raise KeyError("The response does not contain the 'text' key")

    return CompletionResponse(text=output[0]['text'])

This code checks if the 'text' key is present in the output[0] dictionary before attempting to access it. If the key is not present, a KeyError is raised with a descriptive message, preventing the error from occurring unexpectedly ^[1]^[2].

Additionally, ensure that your VllmServer class is correctly set up and that the post_http_request and get_response functions are properly defined and used. Here is an example of how the VllmServer class is initialized and used:

from llama_index.llms.vllm import VllmServer

def messages_to_prompt(messages):
    prompt = "\n".join([str(x) for x in messages])
    return f"<s>[INST] {prompt} [/INST] </s>\n"

def completion_to_prompt(completion):
    return f"<s>[INST] {completion} [/INST] </s>\n"

llm = VllmServer(
    api_url="http://localhost:8000/v1/completions",
    max_new_tokens=256,
    temperature=0.1,
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

llm.complete("What is a black hole?")

Ensure that the api_url is correctly set to the URL of your vLLM server ^[1]^[2].

To continue talking to Dosu, mention @dosu.

0 replies

stg609 · 2024-08-05T22:37:35Z

stg609
Aug 5, 2024

did you find a solution? I have the same issue.

0 replies

nickakube · 2024-09-21T17:34:15Z

nickakube
Sep 21, 2024

I followed your suggestion same error:

ne 9, in get_response
return data["text"]
~~~~^^^^^^^^
KeyError: 'text'

(propertyrag) akube@akube-dev:/nickswork$ vi testvllm.py
(propertyrag) akube@akube-dev:/nickswork$ cat testvllm.py

from llama_index.llms.vllm import VllmServer
from llama_index.core.llms import ChatMessage
from llama_index.core.llms import (
CustomLLM,
CompletionResponse,
CompletionResponseGen,
LLMMetadata,
)

def complete(
self, prompt: str, formatted: bool = False, **kwargs: any)->CompletionResponse:
kwargs = kwargs if kwargs else {}
params = {**self._model_kwargs, **kwargs}

# build sampling parameters
sampling_params = dict(**params)
sampling_params["prompt"] = prompt
response = post_http_request(self.api_url, sampling_params, stream=False)
output = get_response(response)

# Ensure the response contains the 'text' key
if 'text' not in output[0]:
    raise KeyError("The response does not contain the 'text' key")

return CompletionResponse(text=output[0]['text'])
    # specific functions to format for mistral instruct

def messages_to_prompt(messages):
prompt = "\n".join([str(x) for x in messages])
return f"~~[INST] {prompt} [/INST]~~ \n"

def completion_to_prompt(completion):
return f"~~[INST] {completion} [/INST]~~ \n"

def main():
llm = VllmServer(
api_url="http://localhost:8000/completions", model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",max_new_tokens=100, temperature=0,
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,

)
llm.complete("what is a black hole ?")

if name=='main':
main()

0 replies

SauronLee · 2024-10-17T05:32:43Z

SauronLee
Oct 17, 2024

I tried to rewrite the function in the VllmServer class to solve this problem：

from llama_index.llms.vllm import VllmServer
from llama_index.core.llms.callbacks import llm_completion_callback
from typing import Any, Dict, List
from llama_index.core.base.llms.types import CompletionResponse
import requests
import json

def custom_post_http_request(
        api_url: str, sampling_params: dict = {}, stream: bool = False
) -> requests.Response:
    # headers = {"User-Agent": "Test Client"}
    headers = {'Content-Type': 'application/json'}
    sampling_params["stream"] = stream
    return requests.post(api_url, headers=headers, json=sampling_params, stream=True)

def custom_get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    return data['choices'][0]['text']

class CustomVllmServer(VllmServer):
    @property
    def _model_kwargs(self) -> Dict[str, Any]:
        base_kwargs = {
            "temperature": self.temperature,
            "max_tokens": self.max_new_tokens,
            "model": "/data1/workspace-lxr/pretrained_model_weights/llm_weights/qwen/Qwen2-7B-Instruct",
            "stop": ["<|im_end|>"],
            "stream": False
        }
        return {**base_kwargs}

    @llm_completion_callback()
    def complete(
            self, prompt: str, formatted: bool = False, **kwargs: Any
    ) -> CompletionResponse:
        kwargs = kwargs if kwargs else {}
        params = {**self._model_kwargs, **kwargs}

        # build sampling parameters
        sampling_params = dict(**params)
        sampling_params["prompt"] = prompt
        response = custom_post_http_request(self.api_url, sampling_params, stream=False)
        output = custom_get_response(response)

        return CompletionResponse(text=output)


llm = CustomVllmServer(
    api_url="http://192.168.2.253:9678/v1/completions",
    max_new_tokens=256,
    temperature=0.1
)

system_text = "You are a helpful assistant."
prompt = f"<|im_start|>system\n{system_text}<|im_end|>\n<|im_start|>user你是谁？<|im_end|><|im_start|>assistant\n"

response = llm.complete(prompt)
print(response)

Output：

我是阿里云开发的一款超大规模语言模型，我叫通义千问。

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to perform LLM inference when using VllmServer #14420

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Unable to perform LLM inference when using VllmServer #14420

Uh oh!

Uh oh!

xKwan Jun 27, 2024

Replies: 4 comments

Uh oh!

dosubot[bot] bot Jun 27, 2024

Uh oh!

Uh oh!

stg609 Aug 5, 2024

Uh oh!

nickakube Sep 21, 2024

Uh oh!

Uh oh!

SauronLee Oct 17, 2024

xKwan
Jun 27, 2024

dosubot[bot]
bot Jun 27, 2024

stg609
Aug 5, 2024

nickakube
Sep 21, 2024

SauronLee
Oct 17, 2024