You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which version of LM Studio?
Example: LM Studio 0.3.15
Which operating system?
macOS
What is the bug?
When a JIT loaded model is generating tokens, JIT loading a new model cuts the former model off and unloads it straight away instead of letting it finish inference. Context are local models used in open-webui with multiple user. If user A' request is generating tokens, user B could do a request and cut off the generation of user A's request, which is not optimal. Having user B's request wait until user A's is finished is perfectly fine, the waiting time is a good compromise instead of having errors/request hanging.
Unfortunately disabling "Only keep last JIT loaded model" is not an option as lmstudio doesn't automatically unload older models when RAM/VRAM is getting full, contrary to ollama for example which does handle auto unloading in a smarter way, multiple models can be loaded until the loading of a new model would go over the RAM/VRAM limit at which point the oldest model is unloaded.
I believe at least an option to not cut off generating models when JIT loading should be available, and maybe a smarter loading solution similar to ollama should be considered.
Screenshots
JIT settings:
Logs
not relevant
To Reproduce
Steps to reproduce the behavior:
Set JIT settings as above
Make a request that would JIT load model A and run inference (curl request to the openai chat completions endpoint for example)
While model A is generating tokens, make a second request that would JIT load model B
Observe model A being cut off and unloaded in favor of model B
The text was updated successfully, but these errors were encountered:
Which version of LM Studio?
Example: LM Studio 0.3.15
Which operating system?
macOS
What is the bug?
When a JIT loaded model is generating tokens, JIT loading a new model cuts the former model off and unloads it straight away instead of letting it finish inference. Context are local models used in open-webui with multiple user. If user A' request is generating tokens, user B could do a request and cut off the generation of user A's request, which is not optimal. Having user B's request wait until user A's is finished is perfectly fine, the waiting time is a good compromise instead of having errors/request hanging.
Unfortunately disabling "Only keep last JIT loaded model" is not an option as lmstudio doesn't automatically unload older models when RAM/VRAM is getting full, contrary to ollama for example which does handle auto unloading in a smarter way, multiple models can be loaded until the loading of a new model would go over the RAM/VRAM limit at which point the oldest model is unloaded.
I believe at least an option to not cut off generating models when JIT loading should be available, and maybe a smarter loading solution similar to ollama should be considered.
Screenshots
Logs
not relevant
To Reproduce
Steps to reproduce the behavior:
The text was updated successfully, but these errors were encountered: