Running Llama.cpp Server as AsyncOpenAI-like API with Outlines #1663

rishab-sakalkale · 2025-06-27T02:13:48Z

rishab-sakalkale
Jun 27, 2025

Hi all, since I have a CPU-only server, I am trying to find the best set up to run fast JSON-parsing of large text files.

My main need is being able to serve concurrent requests, which unfortunately llama-cpp-python does not currently support. So, I am running the llama.cpp server in the background and treating it as an OpenAI-like API. However, when I run the from_openai() function, I am getting this error:

ValueError: The model argument must be an instance of SteerableModel, BlackBoxModel or AsyncBlackBoxModel
File "/venv/lib64/python3.11/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 121, in exec_func_with_error_handling
result = func()
^^^^^^
File "/venv/lib64/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 645, in code_to_exec
exec(code, module.dict)
File "streamlit_outlines_llama_server.py", line 309, in
results = asyncio.run(generate_multiple(chunks))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "streamlit_outlines_llama_server.py", line 281, in generate_multiple
generator = Generator(model, JsonSchema(json_schema))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib64/python3.11/site-packages/outlines/generator.py", line 344, in Generator
raise ValueError(

My code looks like this:

    async_client = openai.AsyncOpenAI(base_url="http://localhost:8080/v1", api_key = "key")
    async_model = outlines.from_openai(async_client, "Qwen") 

    generator = Generator(model, JsonSchema(json_schema))

    prompts = [generate_prompt(i, chunk) for i, chunk in enumerate(chunks)]

    tasks = [generator(prompt, max_tokens=5000) for prompt in prompts]

    results = await asyncio.gather(*tasks)

I'm not sure why this is the case since async_model is an instance of the OpenAI class, which should be a BlackBoxModel? Also, in this case, can the API still be treated like an async server? Since the AsyncOpenAI class has now been deprecated (as far as I can tell from the v1 update).

Any help would be appreciated. If you have any recommendations for using a different inference engine that has CPU-only support (I've only tried vLLM so far but the setup is hell) and has good concurrency, I'm all ears. Thanks!

rlouf · 2025-06-27T05:21:00Z

rlouf
Jun 27, 2025
Maintainer

You can use from_ollama! https://dottxt-ai.github.io/outlines/latest/features/models/ollama/

0 replies

rishab-sakalkale · 2025-06-27T20:51:37Z

rishab-sakalkale
Jun 27, 2025
Author

Does it support async calls? In the /models/init.py file I see:

AsyncBlackBoxModel = Union[
    AsyncTGI,
    AsyncSGLang,
    AsyncVLLM,
]

So if I want to send several requests and have the Ollama server process them in parallel, even if I create an Ollama AsyncClient, the BlackBoxModel may not treat it as such.

3 replies

rlouf Jun 28, 2025
Maintainer

I'll probably get around to including the async support for Ollama next week!

rishab-sakalkale Jun 30, 2025
Author

Okay, thank you so much! I appreciate all your work :)

rlouf Jul 15, 2025
Maintainer

In case you didn't see, async is now supported for Ollama models!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running Llama.cpp Server as AsyncOpenAI-like API with Outlines #1663

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Running Llama.cpp Server as AsyncOpenAI-like API with Outlines #1663

Uh oh!

rishab-sakalkale Jun 27, 2025

Replies: 2 comments · 3 replies

Uh oh!

rlouf Jun 27, 2025 Maintainer

Uh oh!

rishab-sakalkale Jun 27, 2025 Author

Uh oh!

rlouf Jun 28, 2025 Maintainer

Uh oh!

rishab-sakalkale Jun 30, 2025 Author

Uh oh!

rlouf Jul 15, 2025 Maintainer

rishab-sakalkale
Jun 27, 2025

Replies: 2 comments 3 replies

rlouf
Jun 27, 2025
Maintainer

rishab-sakalkale
Jun 27, 2025
Author

rlouf Jun 28, 2025
Maintainer

rishab-sakalkale Jun 30, 2025
Author

rlouf Jul 15, 2025
Maintainer