You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm building a Retrieval-Augmented Generation (RAG) system using the llama.cpp server with Docker on CPU, utilizing the llama-8B model with Q5_K_M quantization and Elasticsearch.
The llama.cpp server works well for the first prompt and response, but subsequent responses take a long time, likely due to the increasing size of the prompt and context. I've tried all the improvements mentioned in the llama.cpp server README, but they haven't resolved the issue. I'm not sure if I'm doing something wrong or if I'm using it incorrectly.
When running Docker, I include the -p flag, which I understand sets the system prompt, and combine it with --keep -1 to retain the initial prompt if the context exceeds the maximum. I also use -cnv for conversation mode and include the corresponding prefix and suffix. How can I verify if the initial prompt is being recognized correctly? When I open the localhost on port 8080, the UI shows the default llama prompt. Limiting the context to 512 or 128 doesn't seem to help, and responses still take over a minute. Here's the Docker command I use:
docker run -p 8080:8080 -v $(pwd)/model:/models ghcr.io/ggerganov/llama.cpp:server -m models/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf -c 512 -t 10 --in-prefix 'User' --in-suffix 'Llama' -cnv --keep -1 -p "This is a conversation between User and Llama, an intelligent, friendly, and polite medical assistant. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately with precision and provides detailed and helpful answers to user's medical questions, including accurate references where applicable." --host 0.0.0.0 --port 8080
In a second test, I placed the prompt in my Python code and in the Docker run command. I concatenate the initial prompt with user messages and LLM responses, resulting in an increasingly large prompt. Here's a fragment of the code:
llama_prompt = ("You are an intelligent and polite medical assistant "
"who provides detailed and "
"helpful answers to user's medical questions, "
"including accurate references where applicable.")
self.conversation_history.append(f"<|start_header_id|>user<|end_header_id|>\n\n{message}<|eot_id|>\n")
prompt = ("<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
f"{self.llama_prompt}<|eot_id|>")
prompt += "".join(self.conversation_history)
prompt += "<|start_header_id|>assistant<|end_header_id|>"
data = {
"prompt": prompt,
**params
}
response = requests.post(f"{self.base_url}/completion", json=data, stream=True)
self.conversation_history.append(f"<|start_header_id|>assistant<|end_header_id|>{full_content}<|eot_id|>\n")
I modified the Docker run command, removing --prompt-cache-all, --in-prefix 'User', --in-suffix 'Llama', and -cnv:
docker run -p 8080:8080 -v $(pwd)/model:/models ghcr.io/ggerganov/llama.cpp:server -m models/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf -c 512 -t 10 --keep -1 -p "This is a conversation between User and Llama, an intelligent, friendly, and polite medical assistant. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately with precision and provides detailed and helpful answers to user's medical questions, including accurate references where applicable." --host 0.0.0.0 --port 8080
I understand it may be slower due to the larger input prompt. Nevertheless, I expected that --keep -1 would maintain the initial system prompt, resulting in responses being as fast as the first message, or at least not significantly slower.
Despite these changes, responses remain slow, any guidance on what I'm doing wrong would be greatly appreciated. It's important to note that running llama.cpp directly on my computer works very quickly. However, I need it to function with the specified prompt and correctly handle the context.
Is llama.cpp more suited for chatbots rather than RAG?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hey folks,
I'm building a Retrieval-Augmented Generation (RAG) system using the llama.cpp server with Docker on CPU, utilizing the llama-8B model with Q5_K_M quantization and Elasticsearch.
The llama.cpp server works well for the first prompt and response, but subsequent responses take a long time, likely due to the increasing size of the prompt and context. I've tried all the improvements mentioned in the llama.cpp server README, but they haven't resolved the issue. I'm not sure if I'm doing something wrong or if I'm using it incorrectly.
docker run -p 8080:8080 -v $(pwd)/model:/models ghcr.io/ggerganov/llama.cpp:server -m models/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf -c 512 -t 10 --in-prefix 'User' --in-suffix 'Llama' -cnv --keep -1 -p "This is a conversation between User and Llama, an intelligent, friendly, and polite medical assistant. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately with precision and provides detailed and helpful answers to user's medical questions, including accurate references where applicable." --host 0.0.0.0 --port 8080
I modified the Docker run command, removing --prompt-cache-all, --in-prefix 'User', --in-suffix 'Llama', and -cnv:
docker run -p 8080:8080 -v $(pwd)/model:/models ghcr.io/ggerganov/llama.cpp:server -m models/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf -c 512 -t 10 --keep -1 -p "This is a conversation between User and Llama, an intelligent, friendly, and polite medical assistant. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately with precision and provides detailed and helpful answers to user's medical questions, including accurate references where applicable." --host 0.0.0.0 --port 8080
I understand it may be slower due to the larger input prompt. Nevertheless, I expected that --keep -1 would maintain the initial system prompt, resulting in responses being as fast as the first message, or at least not significantly slower.
Despite these changes, responses remain slow, any guidance on what I'm doing wrong would be greatly appreciated. It's important to note that running llama.cpp directly on my computer works very quickly. However, I need it to function with the specified prompt and correctly handle the context.
Is llama.cpp more suited for chatbots rather than RAG?
Beta Was this translation helpful? Give feedback.
All reactions