Scrappy Llama Server model swapping proxy middleware #11286
lukestanley
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I wanted to quickly switch models and save GPU memory / power when idle, and be able to still use speculative decoding and the latest Llama.cpp Server goodness and I am not so experienced with C++ so I wrote this middleware for myself.
When a different model to the currently loaded model is specified, it loads it then proxies the request in a streaming way as normal.
It's just over 200 lines of code and you can ask a LLM what it depends on and how to use it if that's of use to you!
https://gist.github.com/lukestanley/2577d0b8fcb02e678b202fe0fd924b15
It's a hot mess of a GitHub gist but it works. I will it tidy up and think how to do it a bit better.
I have been using Ollama and wanted the latest new features and speed that speculative decoding brings.
Beta Was this translation helpful? Give feedback.
All reactions