Parellel processing similar to Ollama #489

grempire2 · 2024-07-17T12:00:27Z

grempire2
Jul 17, 2024

I'm using Ollama running several 34B quantized models at the same time with my 3090. I'm considering giving my old laptops 32gb rams and run linux system with llmafile to turn them into inferencing servers for non-time critical agentic workflow with quantized models. Would llamafile run and inference several models in parallel similar to Ollama? I belieive this will really bring llamafile to a business use case level. Based on what Justine Tunney claims, the underutilized legacy cpu hardware can levelerage llamafile to liberate the world from gpu dominated applications and I think this is the missing piece if it's not yet built into llamafile.

raymond-infinitecode · 2024-07-31T03:25:47Z

raymond-infinitecode
Jul 31, 2024

Needed the feature, I am running Xeon with Ollama, and we manage to support 32 concurrent users, can't wait llamafile to support that which is way faster.

1 reply

arunaagrawal Aug 16, 2024

@raymond-infinitecode possible to share the infrastructure underlying 32 concurrent users. We have to support 10 concurrent users and I get mixed information on internet. CPU only or Mixed both will help as some customers not ready to buy GPU as yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parellel processing similar to Ollama #489

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Parellel processing similar to Ollama #489

Uh oh!

grempire2 Jul 17, 2024

Replies: 1 comment · 1 reply

Uh oh!

raymond-infinitecode Jul 31, 2024

Uh oh!

arunaagrawal Aug 16, 2024

grempire2
Jul 17, 2024

Replies: 1 comment 1 reply

raymond-infinitecode
Jul 31, 2024