How to optimize prompts and context in llama.cpp server for faster responses #8652

dgbaenar · 2024-07-23T17:29:40Z

dgbaenar
Jul 23, 2024

Hey folks,

I'm building a Retrieval-Augmented Generation (RAG) system using the llama.cpp server with Docker on CPU, utilizing the llama-8B model with Q5_K_M quantization and Elasticsearch.

The llama.cpp server works well for the first prompt and response, but subsequent responses take a long time, likely due to the increasing size of the prompt and context. I've tried all the improvements mentioned in the llama.cpp server README, but they haven't resolved the issue. I'm not sure if I'm doing something wrong or if I'm using it incorrectly.

When running Docker, I include the -p flag, which I understand sets the system prompt, and combine it with --keep -1 to retain the initial prompt if the context exceeds the maximum. I also use -cnv for conversation mode and include the corresponding prefix and suffix. How can I verify if the initial prompt is being recognized correctly? When I open the localhost on port 8080, the UI shows the default llama prompt. Limiting the context to 512 or 128 doesn't seem to help, and responses still take over a minute. Here's the Docker command I use:

docker run -p 8080:8080 -v $(pwd)/model:/models ghcr.io/ggerganov/llama.cpp:server -m models/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf -c 512 -t 10 --in-prefix 'User' --in-suffix 'Llama' -cnv --keep -1 -p "This is a conversation between User and Llama, an intelligent, friendly, and polite medical assistant. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately with precision and provides detailed and helpful answers to user's medical questions, including accurate references where applicable." --host 0.0.0.0 --port 8080

In a second test, I placed the prompt in my Python code and in the Docker run command. I concatenate the initial prompt with user messages and LLM responses, resulting in an increasingly large prompt. Here's a fragment of the code:

llama_prompt = ("You are an intelligent and polite medical assistant "
                    "who provides detailed and "
                    "helpful answers to user's medical questions, "
                    "including accurate references where applicable.")

self.conversation_history.append(f"<|start_header_id|>user<|end_header_id|>\n\n{message}<|eot_id|>\n")

prompt = ("<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
          f"{self.llama_prompt}<|eot_id|>")
prompt += "".join(self.conversation_history)
prompt += "<|start_header_id|>assistant<|end_header_id|>"

data = {
            "prompt": prompt,
            **params
        }

response = requests.post(f"{self.base_url}/completion", json=data, stream=True)
self.conversation_history.append(f"<|start_header_id|>assistant<|end_header_id|>{full_content}<|eot_id|>\n")

I modified the Docker run command, removing --prompt-cache-all, --in-prefix 'User', --in-suffix 'Llama', and -cnv:

docker run -p 8080:8080 -v $(pwd)/model:/models ghcr.io/ggerganov/llama.cpp:server -m models/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf -c 512 -t 10 --keep -1 -p "This is a conversation between User and Llama, an intelligent, friendly, and polite medical assistant. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately with precision and provides detailed and helpful answers to user's medical questions, including accurate references where applicable." --host 0.0.0.0 --port 8080

I understand it may be slower due to the larger input prompt. Nevertheless, I expected that --keep -1 would maintain the initial system prompt, resulting in responses being as fast as the first message, or at least not significantly slower.

Despite these changes, responses remain slow, any guidance on what I'm doing wrong would be greatly appreciated. It's important to note that running llama.cpp directly on my computer works very quickly. However, I need it to function with the specified prompt and correctly handle the context.

Is llama.cpp more suited for chatbots rather than RAG?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to optimize prompts and context in llama.cpp server for faster responses #8652

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to optimize prompts and context in llama.cpp server for faster responses #8652

Uh oh!

dgbaenar Jul 23, 2024

Replies: 0 comments

dgbaenar
Jul 23, 2024