Replies: 12 comments
-
The easiest manual way to load a model on demand is calling |
Beta Was this translation helpful? Give feedback.
-
Another use case where this may be useful is when one swaps between 2 models that do not fit within the VRAM at the same time, like your aider "architect" example in the wiki. In this case, it would be useful if there was some way to load the first model (the architect) after the second model (the coder) has finished. This avoids the waiting time for the first model to be loaded again. |
Beta Was this translation helpful? Give feedback.
-
Been thinking about this a little bit on and off. I think a good place to put it is part of the Groups feature. Here's an example configuration
Thoughts about this design:
|
Beta Was this translation helpful? Give feedback.
-
Maybe it should be a preload and a "reload"? For my use cases i wouldn't use a ttl but i would like a model to be loaded in case there is no model loaded in the group. I would love this feature. i've now glued this function in with a watcher script. |
Beta Was this translation helpful? Give feedback.
-
I think I need to collect a few more use cases of how people may use a preload feature. I’m particularly unsure about automatically swapping back to the preloaded model. What conditions would trigger the swap-back? 🤔 |
Beta Was this translation helpful? Give feedback.
-
One alternative to finding one-size-fits-all strategy would be to introduce hooks and actions, and allow users to do whatever they want with loading/unloading models. So, instead of baking a single pre-load policy into the groups feature, we could expose a small, declarative event system (maybe also relevant to your recently introduced internal event system): Examples of hooks:
Built-in actions (
|
Action | Parameters | Notes |
---|---|---|
load_model |
id: model-id |
Equivalent to GET /upstream/:model |
unload_model |
id: model-id |
Graceful cmdStop + kill |
reload_model |
id: model-id |
Convenience = unload + load |
shell |
cmd: string |
Run arbitrary command (for maximal freedom) |
http |
url: string, method: GET/POST..., headers : array, body : string |
Ping Grafana, webhooks, etc. |
log |
level, msg |
Custom log entries |
The action list is intentionally small since most advanced workflows can be achieved with shell
or http
while the first three should cover 90% of most needs. Typically the most useful http params are: url, headers, method, body.
Examples
Example 1 - Architect / Coder hot-swap
# pre-loads “architect” at startup and reloads it
# whenever the “coder” model is evicted.
hooks:
on_startup:
- load_model: architect
after_unload:
- when:
model: coder # fires only for this model
do:
- load_model: architect
Flow: user requests coder -> llama-swap swaps in coder -> coder hits its ttl -> after_unload triggers -> architect is back in RAM before the next request.
Example 2 - VRAM-tight gaming rig
groups:
my_fav:
swap: true
members:
- qwen3-0.6b
- gemma3-27b
hooks:
before_load:
- when:
group: my_fav
do:
- shell: nvidia-smi --persistence-mode=1 # ensure clean VRAM
after_load:
- shell: "curl -s -XPOST http://grafana/api/warm?model=${_CURRENT_MODEL}"
on_request:
- when:
idle_for: 15 # seconds with no traffic
do:
- unload_model: ${_CURRENT_MODEL}
- load_model: qwen3-0.6b # keep the tiny model resident
Example 3 - Blue-Green production roll-out with automatic warm-up
Use case: Production roll-out - bring a new build of a model online, prime it with a dummy prompt, and then retire the old build once the new one is healthy.
models:
gpt-green:
cmd: "/opt/llama-server --model /models/mymodel-v1.1-green.gguf --port ${PORT}"
proxy: "http://127.0.0.1:${PORT}"
ttl: 0 # never auto-unload
gpt-blue:
cmd: "/opt/llama-server --model /models/mymodel-v1.0-blue.gguf --port ${PORT}"
proxy: "http://127.0.0.1:${PORT}"
ttl: 0
groups:
prod:
swap: true
members: [gpt-green, gpt-blue]
hooks:
# on_startup, load last-known-good version (blue)
on_startup:
- load_model: gpt-blue
# manual curl to /upstream/gpt-green triggers before_load / after_load for green
after_load:
- when:
model: gpt-green
do:
# Warm-up request so users don’t pay KV-cache build time
- http:
url: "http://127.0.0.1:${PORT}/v1/chat/completions"
method: POST
body: |
{ "model":"gpt-green",
"messages":[{"role":"system","content":"warm-up"}],
"max_tokens":1 }
# Retire the previous blue version once green is healthy
- unload_model: gpt-blue
So, you ship a new model (gpt-green) and hit curl -XPOST http://llama-swap/upstream/gpt-green
. When the health-check passes, after_load primes the cache and unloads the old blue build, giving you a "zero-downtime" deployment path.
Example 4 - High-availability fail-over with Slack alert
if a model’s /health endpoint fails, swap in a backup copy and page an on-call engineer.
models:
phi-primary:
cmd: "llama-server --model /models/phi-3-primary.gguf --port ${PORT}"
proxy: "http://127.0.0.1:${PORT}"
ttl: 0
phi-backup:
cmd: "llama-server --model /models/phi-3-backup.gguf --port ${PORT}"
proxy: "http://127.0.0.1:${PORT}"
ttl: 0
hooks:
on_startup:
- load_model: phi-primary
on_health_fail:
- when:
model: phi-primary
do:
- unload_model: phi-primary
- load_model: phi-backup
- http:
url: "https://hooks.slack.com/services/T0000/B0000/XXXXXXXX"
method: POST
body: |
{ "text": "*phi-primary* failed health-check - swapped to *phi-backup*." }
The first time phi-primary returns a non-200 status on /health, llama-swap automatically unloads the problematic instance, spins up phi-backup, and sends a Slack webhook so someone can investigate.
Now, the structure of hooks definition with when
and do
(or directly calling an action) are just suggestions for thought. Thinking also what conditions should be available in the when
statement (e.g. model
, group
, idle_for
). Pre-defined runtime macros (variables) such as ${_CURRENT_MODEL}
as in the examples above may be useful as well.
Some benefits of this approach:
- Users chain several simple actions instead of waiting for a one-size-fits-all preload setting.
- No breaking changes: If the
hooks:
section is omitted, llama-swap should behave exactly as today. - Every hook/action invocation could be logged at debug level, keeping troubleshooting simple.
- New events (e.g.
on_profile_switch
) or actions can be added without touching existing configs.
If you think it is a good idea, then maybe the next steps could be:
- Agree on hook names + JSON payload structure.
- Implement some hooks such as on_startup, after_load, after_unload and the load_model / unload_model actions
- Do not publish the feature in the public docs until the yaml config and hook names are stable. Early adopters can replicate the watcher scripts many of us run today and report gaps.
Beta Was this translation helpful? Give feedback.
-
@henfiber I like the idea of hooks. Maybe instead of “hooks” name it “events” to stay consistent with the new internal event bus system. I’m quite against having complex logic in YAML. It’s a mark up language after all. However, I think we could get pretty far starting with just the shell action. Data can be passed as env variables and macro substitution in the command’s args. This would decouple llama-swap from custom user specified logic. Also it’ll make it easier to troubleshoot. |
Beta Was this translation helpful? Give feedback.
-
@mostlygeek "events" is ok too. I agree that you could start simple as you suggested with a minimal set of actions and event/hooks and introduce new ones only if a common scenario is not covered. Maybe I overdid it a bit, because I am used to the home-assistant automations yaml syntax, which also defines logic declaratively (triggers, conditions, actions triplets with a lot of predefined options). Structured/constrained logic has a benefit compared to DSL languages: it can be parsed and built using a UI (some freedom is re-gained by using jinja2 templates in many cases). So, personally I am used to defining logic in YAML. |
Beta Was this translation helpful? Give feedback.
-
This issue is stale because it has been open for 2 weeks with no activity. |
Beta Was this translation helpful? Give feedback.
-
this is a cool idea! if this is the draft UX we want to implement, i'm more than happy to take a stab at it. personally, i don't really need this feature - but llama-swap's been working so well for me that i'd love to help contribute towards making it better. (thank you for making such a solid piece of software!) |
Beta Was this translation helpful? Give feedback.
-
Hi @benfiola, I don't think there is a design we've settled on yet. A few key questions still need good answers before proceeding:
The discussion sort of went stale for this until you messaged so that may be a sign for this feature too. :) |
Beta Was this translation helpful? Give feedback.
-
I'm just seeing this thread... my use case is to preload simple embedding and small "utility" models that I know other apps like Open WebUI will use. I can't always guarantee that I can configure or modify apps to call an endpoint on launch, and I often reload |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm currently using llama-swap as a proxy between sglang and other services using its api, such as open webui or vscode + continue, in order to allow unloading the model from VRAM after a while that the service isn't used. This works well (I had to tinker a bit to make sglang work with unloading but it worked out in the end), but sglang takes around 30s to load the model in order to perform inference, and I'd like to minimize this latency for my users.
Let's take open webui as an example: a way to accomplish this could be to make llama-swap preload a model as soon as a user first visits the webpage (if no model is loaded already?), so that the model loads while the user writes its prompt, and ideally has finished loading by the time the user sends its request.
Is there a way to do this currently? For example by running a preload command when the proxy receives a GET /v1/models or GET /get_model_info or GET /health request (or whenever the proxy detects any request or connection)?
Beta Was this translation helpful? Give feedback.
All reactions