-
I'm trying to call the same model over and over again in a multi-process program to batch generate. I am using the LangChain framework. I was previously using Ollama, which automates requests to the same model, similar to the parameters below: So I was wondering how to call the model in parallel when I use llama.cpp, here is my code:
Perhaps this is a very basic question. Would appreciate an answer or some references. Sincerely appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Can you rephrase your question, based out of your code what I understood is you want to do multiple calls to receive responses simultaneously ? Is this correct ! |
Beta Was this translation helpful? Give feedback.
Thank you for your reply. I think I have solved my problem.
I gave up using the LangChain framework.
I use
./llama-server -m model/path -np 4
to open parallel requests. Then use the following code to inference LLM:In this way, I can use the model for inference in parallel, greatly improving the speed.