Llama.cpp Server Tool Streaming Workaround

This project provides a Python-based reverse proxy designed to sit in front of a llama.cpp llama-server instance. Its primary goal is to enable OpenAI-compatible streaming for responses that include tool calls (function calling), a feature currently not fully supported by llama-server in a streaming fashion when tools are involved.

When llama-server processes a request that results in tool calls, it typically sends the entire JSON response at once, even if stream: true was requested by the client. This proxy intercepts such responses, and if tool calls are present (or for any non-streaming JSON response to a chat completion endpoint), it simulates an OpenAI-style text/event-stream response. This makes it compatible with clients and frameworks that expect this specific streaming behavior for tool integration.

This is intended as a temporary glue/shim server until llama-server natively supports robust streaming for tool call responses.

Problem Solved

llama-server (from llama.cpp) is a tool for serving local LLMs with an OpenAI-compatible API. However, a current limitation is its handling of streaming responses when tool_calls (function calls) are part of the model's output. Instead of streaming the tool call information progressively as OpenAI's API does, llama-server often sends the full, non-streamed JSON response.

This proxy addresses this by:

Forcing stream: false in requests to the backend llama-server (to ensure it gets the complete response with potential tool calls).
Receiving the full JSON response from llama-server.
If the response is for a chat completion and contains tool calls (or is just a standard JSON response intended for streaming), it then synthesizes a text/event-stream in the OpenAI format, chunking the content and tool calls appropriately.
Forwarding other requests (like /v1/models) as-is.

Who Needs This?

This workaround is beneficial for applications and frameworks that:

Utilize the OpenAI SDK (or compatible libraries) to interact with LLMs.
Rely on streaming responses for a better user experience or real-time processing.
Employ tool calling / function calling features with models served by llama-server.

Specifically, if your application expects to receive delta updates containing tool_calls within a stream, and llama-server is not providing them in that manner, this proxy can bridge the gap.

Examples of tools/frameworks that can benefit:

Open Interpreter: When configured to use a local llama-server instance that supports tool calling, Open Interpreter expects streaming responses, including for tool execution steps. This proxy helps ensure compatibility.
n8n.io: When using n8n's LLM nodes (e.g., OpenAI node) with a custom llama-server backend for models with function/tool calling capabilities, n8n might expect OpenAI-compliant streaming for these interactions.

Features

Simulated Streaming for Tool Calls: Converts llama-server's non-streamed tool call responses into an OpenAI-compatible text/event-stream. Tool calling support was added by @PrideIsLife
Configurable: All key parameters (ports, target server URL, SSL verification) are managed via a config.yaml file.
Flexible Target Server Connection:
- Connects to HTTP backends.
- Connects to HTTPS backends with standard SSL certificate verification.
- Connects to HTTPS backends ignoring SSL certificate errors (useful for development with self-signed certificates).
Dockerized: Includes Dockerfile and docker-compose.yml for easy deployment.
Basic Logging: Provides insight into requests and responses.

Prerequisites

Python 3.8+
pip (Python package installer)
A running instance of llama-server (from llama.cpp) that is configured with a model capable of tool/function calling.
Docker and Docker Compose (Recommended for easy deployment).

Configuration

Configuration is handled via a config.yaml file. An example config.yaml.example is provided:

# config.yaml.example

# Port on which the reverse proxy server will listen
proxy_port: 8066

# Target server configuration (your llama-server instance)
# Examples:
#   For HTTP: "http://localhost:8080"
#   For HTTPS with certificate verification: "https://secure.llama.server.com"
#   For HTTPS ignoring certificate errors: "https://localhost:8081" # if llama-server uses HTTPS with self-signed cert
target_url: "http://localhost:8080" # Adjust to your llama-server address

# SSL verification for the target_url (only if target_url is HTTPS)
# true: Verify SSL certificate (default)
# false: Do not verify SSL certificate
# string: Path to a CA bundle file or directory
verify_ssl: true

# Chunk size for simulated text streaming from a complete response
streaming_chunk_size: 50

# Optional: Request timeout for backend requests (seconds)
# request_timeout: 30

# Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
log_level: "INFO"

Setup and Running

Using Docker Compose (Recommended)

This is the easiest way to get the proxy running.

git https://github.com/crashr/llama-stream
cd llama-stream

Before running, copy config.yaml.example to config.yaml and modify it to point to your llama-server instance and set your desired proxy port. Ensure target_url points to your llama-server and proxy_port is set as desired. The port mapping in docker-compose.yml assumes the proxy_port inside the container is 8066. If you change proxy_port in config.yaml, update the right side of the port mapping in docker-compose.yml (e.g., ports: - "8066:NEW_PROXY_PORT").

docker-compose up --build -d

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Dockerfile		Dockerfile
README.md		README.md
config.yaml.example		config.yaml.example
docker-compose.yml		docker-compose.yml
llama-stream.py		llama-stream.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Llama.cpp Server Tool Streaming Workaround

Problem Solved

Who Needs This?

Features

Prerequisites

Configuration

Setup and Running

About

Uh oh!

Releases

Packages

Uh oh!

Languages

crashr/llama-stream

Folders and files

Latest commit

History

Repository files navigation

Llama.cpp Server Tool Streaming Workaround

Problem Solved

Who Needs This?

Features

Prerequisites

Configuration

Setup and Running

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages