llama_server: allow streaming tool use #12601

khimaros · 2025-03-27T04:30:42Z

khimaros
Mar 27, 2025

5ire, the most recommended open source MCP client requires streaming and tool use.

however, llama_server doesn't allow this:

llama.cpp/examples/server/utils.hpp

Line 565 in f17a3bb

throw std::runtime_error("Cannot use tools with stream");

it would be great to be able to use this tool with llama.cpp directly.

unclemusclez · 2025-04-28T08:36:37Z

unclemusclez
Apr 28, 2025

this is also a requirement for OpenAI Codex
llama-server[1090]: got exception: {"code":500,"message":"Cannot use tools with stream","type":"server_error"}

0 replies

jax0m · 2025-05-12T21:07:00Z

jax0m
May 12, 2025

this is also a requirement for OpenAI Codex llama-server[1090]: got exception: {"code":500,"message":"Cannot use tools with stream","type":"server_error"}

If implemented this may resolve issues with Agent calls from n8n n8n-io/n8n#13112

The current version of OpenAI API agent calls via the n8n chain results in the same error output when called as

0 replies

KhazAkar · 2025-05-15T17:24:31Z

KhazAkar
May 15, 2025

It also affects usage inside zed.dev - they need tool use support in stream

0 replies

taha-yassine · 2025-05-16T13:13:35Z

taha-yassine
May 16, 2025

It's being worked on: #12379

0 replies

nwdxlgzs · 2025-05-18T07:28:58Z

nwdxlgzs
May 18, 2025

I think streaming tool call support is very important. I only provide the model with programming interfaces, allowing it to programmatically fetch information (such as the current time). After thinking, the model is more likely to use tools and correctly write code. The content during the thinking phase and the tool calls are separate. It would be great if the thinking phase could be viewed in a streaming manner, so we don’t have to wait indefinitely for an unknown duration (although the thinking time is still unknown, we could at least see the model’s current progress). Of course, it would be even better if full streaming tool calls were directly supported, enabling models without thinking capabilities to use tools while streaming.

Excerpted part of the code which i tested

messages = [
    {"role": "system", "content": "You are a helpful assistant.\nYou can call functions with appropriate input when necessary(Even solved through programming APIs)."},
    {"role": "user", "content": "what time is it now?"},
]

API_tools = [
    {
        "type": "function",
        "function": {
            "name": "lua54",
            "description": "some possible problems that need to be solved through programming APIs, can be solved through Lua, version Lua54, the output content needs to be output through print (not only expressions), the return value is json (code, stdout, stderr)."},
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "the code to execute"
                    },
                    "cwd": {
                        "type": "string",
                        "description":"current working directory (optional)"
                    },
                    "env": {
                        "type": "object",
                        "additionalProperties": {
                            "type": "string"
                        },
                        "description": "environment variables (optional)"
                    }
                },
                "required": [
                    "code"
                ]
            }
        }
    }
]


payload = {
        ...
        "tools": API_tools,
        "tool_choice": "auto",
}

def Lua54Service(code, cwd=None, env=None):
    try:
        process = subprocess.Popen(
            [executable, "-e", code],
            cwd=cwd or os.getcwd(),
            env={**os.environ, **(env or {})},
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True
        )
        stdout, stderr = process.communicate()
        return {"code": process.returncode, "stdout": stdout, "stderr": stderr}
    except Exception as e:
        raise RuntimeError(f"Lua execution failed: {str(e)}")

0 replies

taha-yassine · 2025-05-25T01:29:13Z

taha-yassine
May 25, 2025

Streaming tool calls just got merged and is available in the latest release :)

4 replies

jax0m May 25, 2025

Well I was deciding what I was doing tonight, this might've solved that issue as well

jax0m May 25, 2025

Specifically in f5cd27b

jax0m May 25, 2025

Whelp, looks like it needs some TLC

https://github.com/ggml-org/llama.cpp/releases/tag/b5478

Pulled copy fresh, compiled as a docker container with docker build -t local/llama.cpp:server-cuda --target server -f .devops/cuda.Dockerfile . from

llama.cpp/docs/docker.md

Line 66 in f5cd27b

    
           docker build -t local/llama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .

Results:

Test: n8n direct to llama-cpp-server in container send "Hello" and return:

n8n version

1.93.0 (Self Hosted)

Time

5/24/2025, 8:38:54 PM

Error cause

{ "status": 500, "headers": { "access-control-allow-origin": "", "content-length": "85", "content-type": "application/json; charset=utf-8", "keep-alive": "timeout=5, max=100", "server": "llama.cpp" }, "error": { "code": 500, "message": "Cannot use tools with stream", "type": "server_error" }, "code": 500, "type": "server_error", "attemptNumber": 3, "retriesLeft": 0 }

test: n8n calling same container but frontended with openwebui:

input:

{
  "messages": [
    "System: You are a helpful assistant\nHuman: Hello"
  ],
  "estimatedTokens": 11,
  "options": {
    "openai_api_key": {
      "lc": 1,
      "type": "secret",
      "id": [
        "OPENAI_API_KEY"
      ]
    },
    "model": "Qwen3-32B",
    "timeout": 60000,
    "max_retries": 2,
    "configuration": {
      "baseURL": "{{{openwebuiserver}}}"
    },
    "model_kwargs": {}
  }
}

output:

{
  "response": {
    "generations": [
      [
        {
          "text": "<think>Okay, the user said \"Hello\". I need to respond appropriately. Since there's no specific query here that requires using any of the provided tools, I should just reply with a friendly greeting. Let me check the available functions again to make sure none are needed. The Home_Assistant, Date_Time, Google_Calendar, and wikipedia-api functions are available, but the user isn't asking for anything that needs those tools right now. So I'll just respond with a hello and offer assistance.</think>Hello! How can I assist you today?",
          "generationInfo": {
            "prompt": 0,
            "completion": 0,
            "finish_reason": "stop",
            "system_fingerprint": "b5478-f5cd27b7",
            "model_name": "Qwen3-32B"
          }
        }
      ]
    ]
  },
  "tokenUsage": {
    "completionTokens": 113,
    "promptTokens": 432,
    "totalTokens": 545
  }
}

Follow up question:

{
  "messages": [
    "System: You are a helpful assistant\nHuman: Hello\nAI: <think>Okay, the user said \"Hello\". I need to respond appropriately. Since there's no specific query here that requires using any of the provided tools, I should just reply with a friendly greeting. Let me check the available functions again to make sure none are needed. The Home_Assistant, Date_Time, Google_Calendar, and wikipedia-api functions are available, but the user isn't asking for anything that needs those tools right now. So I'll just respond with a hello and offer assistance.</think>Hello! How can I assist you today?\nHuman: What tools do you have available?"
  ],
  "estimatedTokens": 132,
  "options": {
    "openai_api_key": {
      "lc": 1,
      "type": "secret",
      "id": [
        "OPENAI_API_KEY"
      ]
    },
    "model": "Qwen3-32B",
    "timeout": 60000,
    "max_retries": 2,
    "configuration": {
      "baseURL": "{{{openwebuiserver}}}"
    },
    "model_kwargs": {}
  }
}

error returned in n8n:

{
  "errorMessage": "Premature close",
  "errorDetails": {},
  "n8nDetails": {
    "time": "5/24/2025, 8:45:19 PM",
    "n8nVersion": "1.93.0 (Self Hosted)",
    "binaryDataMode": "default",
    "cause": {
      "code": "ERR_STREAM_PREMATURE_CLOSE"
    }
  }
}

Also container crashes, restarts

jax0m May 25, 2025

However, using the "Proxy" (I updated the one linked above a bit) seems to work pretty well overall so far. I am able to make tool calls and converse with the agent via n8n and openwebui (even on older llama-cpp-server docker images).

crashr · 2025-05-25T16:28:59Z

crashr
May 25, 2025

The proxy lives here now
https://github.com/crashr/llama-stream

0 replies

llama_server: allow streaming tool use #12601

Uh oh!

Replies: 7 comments · 4 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 7 comments 4 replies