Replies: 1 comment 1 reply
-
Needed the feature, I am running Xeon with Ollama, and we manage to support 32 concurrent users, can't wait llamafile to support that which is way faster. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm using Ollama running several 34B quantized models at the same time with my 3090. I'm considering giving my old laptops 32gb rams and run linux system with llmafile to turn them into inferencing servers for non-time critical agentic workflow with quantized models. Would llamafile run and inference several models in parallel similar to Ollama? I belieive this will really bring llamafile to a business use case level. Based on what Justine Tunney claims, the underutilized legacy cpu hardware can levelerage llamafile to liberate the world from gpu dominated applications and I think this is the missing piece if it's not yet built into llamafile.
Beta Was this translation helpful? Give feedback.
All reactions