Skip to content

Can llama.cpp run the same model in parallel? #8840

Answered by Zuo-Peng
Zuo-Peng asked this question in Q&A
Discussion options

You must be logged in to vote

Thank you for your reply. I think I have solved my problem.
I gave up using the LangChain framework.

I use ./llama-server -m model/path -np 4 to open parallel requests. Then use the following code to inference LLM:

import requests

url = "http://localhost:8080/completion"
headers = {"Content-Type": "application/json"}
data = {"prompt": prompt}
response = requests.post(url, headers=headers, json=data)
content = response.json()['content']

In this way, I can use the model for inference in parallel, greatly improving the speed.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@Zuo-Peng
Comment options

Answer selected by Zuo-Peng
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants