Skip to content

JIT loaded model cutting off and unloading generating model #671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
arty-hlr opened this issue May 21, 2025 · 0 comments
Open

JIT loaded model cutting off and unloading generating model #671

arty-hlr opened this issue May 21, 2025 · 0 comments

Comments

@arty-hlr
Copy link

Which version of LM Studio?
Example: LM Studio 0.3.15

Which operating system?
macOS

What is the bug?
When a JIT loaded model is generating tokens, JIT loading a new model cuts the former model off and unloads it straight away instead of letting it finish inference. Context are local models used in open-webui with multiple user. If user A' request is generating tokens, user B could do a request and cut off the generation of user A's request, which is not optimal. Having user B's request wait until user A's is finished is perfectly fine, the waiting time is a good compromise instead of having errors/request hanging.

Unfortunately disabling "Only keep last JIT loaded model" is not an option as lmstudio doesn't automatically unload older models when RAM/VRAM is getting full, contrary to ollama for example which does handle auto unloading in a smarter way, multiple models can be loaded until the loading of a new model would go over the RAM/VRAM limit at which point the oldest model is unloaded.

I believe at least an option to not cut off generating models when JIT loading should be available, and maybe a smarter loading solution similar to ollama should be considered.

Screenshots

  • JIT settings:
    Image

Logs
not relevant

To Reproduce
Steps to reproduce the behavior:

  1. Set JIT settings as above
  2. Make a request that would JIT load model A and run inference (curl request to the openai chat completions endpoint for example)
  3. While model A is generating tokens, make a second request that would JIT load model B
  4. Observe model A being cut off and unloaded in favor of model B
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant