Skip to content

JIT loaded model cutting off and unloading generating model #671

Open
@arty-hlr

Description

@arty-hlr

Which version of LM Studio?
Example: LM Studio 0.3.15

Which operating system?
macOS

What is the bug?
When a JIT loaded model is generating tokens, JIT loading a new model cuts the former model off and unloads it straight away instead of letting it finish inference. Context are local models used in open-webui with multiple user. If user A' request is generating tokens, user B could do a request and cut off the generation of user A's request, which is not optimal. Having user B's request wait until user A's is finished is perfectly fine, the waiting time is a good compromise instead of having errors/request hanging.

Unfortunately disabling "Only keep last JIT loaded model" is not an option as lmstudio doesn't automatically unload older models when RAM/VRAM is getting full, contrary to ollama for example which does handle auto unloading in a smarter way, multiple models can be loaded until the loading of a new model would go over the RAM/VRAM limit at which point the oldest model is unloaded.

I believe at least an option to not cut off generating models when JIT loading should be available, and maybe a smarter loading solution similar to ollama should be considered.

Screenshots

  • JIT settings:
    Image

Logs
not relevant

To Reproduce
Steps to reproduce the behavior:

  1. Set JIT settings as above
  2. Make a request that would JIT load model A and run inference (curl request to the openai chat completions endpoint for example)
  3. While model A is generating tokens, make a second request that would JIT load model B
  4. Observe model A being cut off and unloaded in favor of model B

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions