Scrappy Llama Server model swapping proxy middleware #11286

lukestanley · 2025-01-18T00:03:33Z

lukestanley
Jan 18, 2025

I wanted to quickly switch models and save GPU memory / power when idle, and be able to still use speculative decoding and the latest Llama.cpp Server goodness and I am not so experienced with C++ so I wrote this middleware for myself.
When a different model to the currently loaded model is specified, it loads it then proxies the request in a streaming way as normal.
It's just over 200 lines of code and you can ask a LLM what it depends on and how to use it if that's of use to you!
https://gist.github.com/lukestanley/2577d0b8fcb02e678b202fe0fd924b15
It's a hot mess of a GitHub gist but it works. I will it tidy up and think how to do it a bit better.
I have been using Ollama and wanted the latest new features and speed that speculative decoding brings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scrappy Llama Server model swapping proxy middleware #11286

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Scrappy Llama Server model swapping proxy middleware #11286

Uh oh!

lukestanley Jan 18, 2025

Replies: 0 comments

lukestanley
Jan 18, 2025