Running Llama.cpp Server as AsyncOpenAI-like API with Outlines #1663
Unanswered
rishab-sakalkale
asked this question in
Q&A
Replies: 2 comments 3 replies
-
You can use |
Beta Was this translation helpful? Give feedback.
0 replies
-
Does it support async calls? In the /models/init.py file I see:
So if I want to send several requests and have the Ollama server process them in parallel, even if I create an Ollama AsyncClient, the BlackBoxModel may not treat it as such. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all, since I have a CPU-only server, I am trying to find the best set up to run fast JSON-parsing of large text files.
My main need is being able to serve concurrent requests, which unfortunately llama-cpp-python does not currently support. So, I am running the llama.cpp server in the background and treating it as an OpenAI-like API. However, when I run the
from_openai()
function, I am getting this error:My code looks like this:
I'm not sure why this is the case since
async_model
is an instance of the OpenAI class, which should be a BlackBoxModel? Also, in this case, can the API still be treated like an async server? Since the AsyncOpenAI class has now been deprecated (as far as I can tell from the v1 update).Any help would be appreciated. If you have any recommendations for using a different inference engine that has CPU-only support (I've only tried vLLM so far but the setup is hell) and has good concurrency, I'm all ears. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions