[Feature request] Allow preloading models #209

antolucib · 2025-06-06T20:02:27Z

antolucib
Jun 6, 2025

I'm currently using llama-swap as a proxy between sglang and other services using its api, such as open webui or vscode + continue, in order to allow unloading the model from VRAM after a while that the service isn't used. This works well (I had to tinker a bit to make sglang work with unloading but it worked out in the end), but sglang takes around 30s to load the model in order to perform inference, and I'd like to minimize this latency for my users.
Let's take open webui as an example: a way to accomplish this could be to make llama-swap preload a model as soon as a user first visits the webpage (if no model is loaded already?), so that the model loads while the user writes its prompt, and ideally has finished loading by the time the user sends its request.
Is there a way to do this currently? For example by running a preload command when the proxy receives a GET /v1/models or GET /get_model_info or GET /health request (or whenever the proxy detects any request or connection)?

mostlygeek · 2025-06-06T20:31:59Z

mostlygeek
Jun 6, 2025
Maintainer

The easiest manual way to load a model on demand is calling /upstream/:model. This will swap to a model the same way as calling the OpenAI api endpoints.

0 replies

henfiber · 2025-06-12T00:35:13Z

henfiber
Jun 12, 2025

Another use case where this may be useful is when one swaps between 2 models that do not fit within the VRAM at the same time, like your aider "architect" example in the wiki.

In this case, it would be useful if there was some way to load the first model (the architect) after the second model (the coder) has finished. This avoids the waiting time for the first model to be loaded again.

0 replies

mostlygeek · 2025-06-23T03:44:12Z

mostlygeek
Jun 23, 2025
Maintainer

Been thinking about this a little bit on and off. I think a good place to put it is part of the Groups feature.

Here's an example configuration

groups: 
    exampleGroup:
        preload: "member1"
        members: 
            - model1
            - model2
            - model3

Thoughts about this design:

swap: true must be set to true
when to unload model2 and model3 so model1 is automatically preloaded?
- should it be done by requests?
- should it be done by ttl that already exists on a model's configuration?

0 replies

HenkieTenkie62 · 2025-07-02T07:07:31Z

HenkieTenkie62
Jul 2, 2025

Maybe it should be a preload and a "reload"?
On startup it will load the preload model.
If one of the models unloads after ttl it will default to the reload.

For my use cases i wouldn't use a ttl but i would like a model to be loaded in case there is no model loaded in the group.
Could be when the inference server crashed (OOM or otherwise).

I would love this feature. i've now glued this function in with a watcher script.

0 replies

mostlygeek · 2025-07-02T07:24:19Z

mostlygeek
Jul 2, 2025
Maintainer

I think I need to collect a few more use cases of how people may use a preload feature.

I’m particularly unsure about automatically swapping back to the preloaded model. What conditions would trigger the swap-back? 🤔

0 replies

henfiber · 2025-07-05T00:24:58Z

henfiber
Jul 5, 2025

One alternative to finding one-size-fits-all strategy would be to introduce hooks and actions, and allow users to do whatever they want with loading/unloading models.

So, instead of baking a single pre-load policy into the groups feature, we could expose a small, declarative event system (maybe also relevant to your recently introduced internal event system):

Examples of hooks:

Hook (`when`)	Fires	Typical uses
`on_startup`	once, after llama-swap boots	Pre-warm a default model
`on_list_models`	once, when /v1/models is first called	Pre-warm a default model
`before_load`	just before a model's `cmd` is executed	Reserve VRAM, notify a dashboard
`after_load`	when the health-check passes	Mark model "ready", trigger a warm-up prompt
`before_unload`	right before SIGTERM / `cmdStop`	Flush logs, snapshot KV-cache
`after_unload`	after the process exits	Auto-reload a standby model, free resources
`after_expire`	when a model is evicted by `ttl`	Same as above but only for TTL expiry
`on_request`	first request per model after `idle_for` seconds	Lazy-load or metrics
`on_health_fail`	a health-check returns non-200	Restart container, alert
`on_error`	llama-swap itself panics	Send Slack / e-mail

The hook handler receives a JSON payload:
{ "model":"coder", "group":"coding", "reason":"ttl" … }

Built-in actions (`do`)

Action	Parameters	Notes
`load_model`	`id:` model-id	Equivalent to `GET /upstream/:model`
`unload_model`	`id:` model-id	Graceful `cmdStop` + kill
`reload_model`	`id:` model-id	Convenience = unload + load
`shell`	`cmd:` string	Run arbitrary command (for maximal freedom)
`http`	`url:` string, `method:` GET/POST..., `headers`: array, `body`: string	Ping Grafana, webhooks, etc.
`log`	`level, msg`	Custom log entries

The action list is intentionally small since most advanced workflows can be achieved with shell or http while the first three should cover 90% of most needs. Typically the most useful http params are: url, headers, method, body.

Examples

Example 1 - Architect / Coder hot-swap

# pre-loads “architect” at startup and reloads it
# whenever the “coder” model is evicted.
hooks:
  on_startup:
    - load_model: architect

  after_unload:
    - when:
        model: coder        # fires only for this model
      do:
        - load_model: architect

Flow: user requests coder -> llama-swap swaps in coder -> coder hits its ttl -> after_unload triggers -> architect is back in RAM before the next request.

Example 2 - VRAM-tight gaming rig

groups:
  my_fav:
    swap: true
    members:
      - qwen3-0.6b
      - gemma3-27b

hooks:
  before_load:
    - when:
        group: my_fav
      do:
        - shell: nvidia-smi --persistence-mode=1  # ensure clean VRAM

  after_load:
    - shell: "curl -s -XPOST http://grafana/api/warm?model=${_CURRENT_MODEL}"

  on_request:
    - when:
        idle_for: 15          # seconds with no traffic
      do:
        - unload_model: ${_CURRENT_MODEL}
        - load_model: qwen3-0.6b   # keep the tiny model resident

Example 3 - Blue-Green production roll-out with automatic warm-up

Use case: Production roll-out - bring a new build of a model online, prime it with a dummy prompt, and then retire the old build once the new one is healthy.

models:
  gpt-green:
    cmd: "/opt/llama-server --model /models/mymodel-v1.1-green.gguf --port ${PORT}"
    proxy: "http://127.0.0.1:${PORT}"
    ttl: 0         # never auto-unload

  gpt-blue:
    cmd: "/opt/llama-server --model /models/mymodel-v1.0-blue.gguf --port ${PORT}"
    proxy: "http://127.0.0.1:${PORT}"
    ttl: 0

groups:
  prod:
    swap: true
    members: [gpt-green, gpt-blue]

hooks:
  #  on_startup, load last-known-good version (blue)
  on_startup:
    - load_model: gpt-blue

  # manual curl to /upstream/gpt-green triggers before_load / after_load for green
  after_load:
    - when:
        model: gpt-green
      do:
        # Warm-up request so users don’t pay KV-cache build time
        - http:
            url: "http://127.0.0.1:${PORT}/v1/chat/completions"
            method: POST
            body: |
              { "model":"gpt-green",
                "messages":[{"role":"system","content":"warm-up"}],
                "max_tokens":1 }

        # Retire the previous blue version once green is healthy
        - unload_model: gpt-blue

So, you ship a new model (gpt-green) and hit curl -XPOST http://llama-swap/upstream/gpt-green. When the health-check passes, after_load primes the cache and unloads the old blue build, giving you a "zero-downtime" deployment path.

Example 4 - High-availability fail-over with Slack alert

if a model’s /health endpoint fails, swap in a backup copy and page an on-call engineer.

models:
  phi-primary:
    cmd: "llama-server --model /models/phi-3-primary.gguf --port ${PORT}"
    proxy: "http://127.0.0.1:${PORT}"
    ttl: 0

  phi-backup:
    cmd: "llama-server --model /models/phi-3-backup.gguf --port ${PORT}"
    proxy: "http://127.0.0.1:${PORT}"
    ttl: 0

hooks:
  on_startup:
    - load_model: phi-primary

  on_health_fail:
    - when:
        model: phi-primary
      do:
        - unload_model: phi-primary
        - load_model:   phi-backup
        - http:
            url: "https://hooks.slack.com/services/T0000/B0000/XXXXXXXX"
            method: POST
            body: |
              { "text": "*phi-primary* failed health-check - swapped to *phi-backup*." }

The first time phi-primary returns a non-200 status on /health, llama-swap automatically unloads the problematic instance, spins up phi-backup, and sends a Slack webhook so someone can investigate.

Now, the structure of hooks definition with when and do (or directly calling an action) are just suggestions for thought. Thinking also what conditions should be available in the when statement (e.g. model, group, idle_for). Pre-defined runtime macros (variables) such as ${_CURRENT_MODEL} as in the examples above may be useful as well.

Some benefits of this approach:

Users chain several simple actions instead of waiting for a one-size-fits-all preload setting.
No breaking changes: If the hooks: section is omitted, llama-swap should behave exactly as today.
Every hook/action invocation could be logged at debug level, keeping troubleshooting simple.
New events (e.g. on_profile_switch) or actions can be added without touching existing configs.

If you think it is a good idea, then maybe the next steps could be:

Agree on hook names + JSON payload structure.
Implement some hooks such as on_startup, after_load, after_unload and the load_model / unload_model actions
Do not publish the feature in the public docs until the yaml config and hook names are stable. Early adopters can replicate the watcher scripts many of us run today and report gaps.

0 replies

mostlygeek · 2025-07-05T01:16:59Z

mostlygeek
Jul 5, 2025
Maintainer

@henfiber I like the idea of hooks. Maybe instead of “hooks” name it “events” to stay consistent with the new internal event bus system.

I’m quite against having complex logic in YAML. It’s a mark up language after all. However, I think we could get pretty far starting with just the shell action. Data can be passed as env variables and macro substitution in the command’s args. This would decouple llama-swap from custom user specified logic. Also it’ll make it easier to troubleshoot.

0 replies

henfiber · 2025-07-05T01:45:04Z

henfiber
Jul 5, 2025

@mostlygeek "events" is ok too. I agree that you could start simple as you suggested with a minimal set of actions and event/hooks and introduce new ones only if a common scenario is not covered.

Maybe I overdid it a bit, because I am used to the home-assistant automations yaml syntax, which also defines logic declaratively (triggers, conditions, actions triplets with a lot of predefined options). Structured/constrained logic has a benefit compared to DSL languages: it can be parsed and built using a UI (some freedom is re-gained by using jinja2 templates in many cases). So, personally I am used to defining logic in YAML.

0 replies

2025-07-19T02:18:36Z

github-actions[bot]
bot Jul 19, 2025

This issue is stale because it has been open for 2 weeks with no activity.

0 replies

benfiola · 2025-07-27T19:56:38Z

benfiola
Jul 27, 2025

this is a cool idea! if this is the draft UX we want to implement, i'm more than happy to take a stab at it.

personally, i don't really need this feature - but llama-swap's been working so well for me that i'd love to help contribute towards making it better. (thank you for making such a solid piece of software!)

0 replies

mostlygeek · 2025-07-27T21:01:47Z

mostlygeek
Jul 27, 2025
Maintainer

Hi @benfiola,

I don't think there is a design we've settled on yet. A few key questions still need good answers before proceeding:

who's maintaining this feature going forward? I'm not ready to make that commitment right now.
are hooks synchronous or async? Perhaps both hooks (synchronous) and events (asynchronous)?
where to keep logic? I'm against YAML but willing to discuss the right balance.

The discussion sort of went stale for this until you messaged so that may be a sign for this feature too. :)

0 replies

samteezy · 2025-08-04T17:59:46Z

samteezy
Aug 4, 2025

I'm just seeing this thread... my use case is to preload simple embedding and small "utility" models that I know other apps like Open WebUI will use.

I can't always guarantee that I can configure or modify apps to call an endpoint on launch, and I often reload llama-swap due to reconfiguration, so allowing llama-swap to load them and keep them "warm" would be beneficial for me.

0 replies

[Feature request] Allow preloading models #209

Uh oh!

Uh oh!

antolucib Jun 6, 2025

Replies: 12 comments

Uh oh!

mostlygeek Jun 6, 2025 Maintainer

Uh oh!

henfiber Jun 12, 2025

Uh oh!

mostlygeek Jun 23, 2025 Maintainer

Uh oh!

HenkieTenkie62 Jul 2, 2025

Uh oh!

Uh oh!

mostlygeek Jul 2, 2025 Maintainer

Uh oh!

henfiber Jul 5, 2025

Built-in actions (do)

Examples

Uh oh!

mostlygeek Jul 5, 2025 Maintainer

Uh oh!

henfiber Jul 5, 2025

Uh oh!

github-actions[bot] bot Jul 19, 2025

Uh oh!

Uh oh!

benfiola Jul 27, 2025

Uh oh!

mostlygeek Jul 27, 2025 Maintainer

Uh oh!

samteezy Aug 4, 2025

antolucib
Jun 6, 2025

mostlygeek
Jun 6, 2025
Maintainer

henfiber
Jun 12, 2025

mostlygeek
Jun 23, 2025
Maintainer

HenkieTenkie62
Jul 2, 2025

mostlygeek
Jul 2, 2025
Maintainer

henfiber
Jul 5, 2025

Built-in actions (`do`)

mostlygeek
Jul 5, 2025
Maintainer

henfiber
Jul 5, 2025

github-actions[bot]
bot Jul 19, 2025

benfiola
Jul 27, 2025

mostlygeek
Jul 27, 2025
Maintainer

samteezy
Aug 4, 2025