Replies: 2 comments
-
3 min for 2000 tokens is 11 tps, that sounds about right for CPU mode. If you feel like something is slower than it should be, it would help if you post the |
Beta Was this translation helpful? Give feedback.
-
So this is normal huh I wanted to add to my desktop app an AI functionality but for it to tell good response it needs to base them on long technical document (40 000 pages). I just take 1 pages throug search from those 40 000 page and add it to the prompt. Then it gives good result even on Q4 version of finetuned llama2. This made me think my desktop app would work on any good pc even without GPU. Unfortunately with this problem I don't know what to do anymore lol. OpenAI api would be at least 500$ per month even with few users. I could try finetuning llama2-7b on all my 40 000 pages but people say it doesn't seem to work. Rag is always better than finetuning in this specific case. Anyway I'll stop the rambling. Thanks for the clarification. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I installed llama.cpp and I run [tulpar-7b-v0.Q4_K_M.gguf] on it.
I don't have a dedicated GPU I run it on my 12th gen i7-1255u CPU with 16gb ram.
It's good and fast when the prompt is 1 or 2 lines. But it take more than 3 minutes (!) when I put in the 2000 token prompt I created for my Usecase.
I looked at this thread: #229
Someone said : This is not a llama.cpp problem this is a 4bit problem. 8bit does not have this, sure it's slow but it starts generating right at the start. But 4bit has a delay before anything starts. GPU/CPU does not matter there's a delay with 4bit.
Should I install this then? [tulpar-7b-v0.Q8_0.gguf] But I imagine it will get slower?
Here (qwopqwop200/GPTQ-for-LLaMa#87) they were suggesting to fallback to PyTorch matmul on large input sizes. But it's for gptq. I installed a gguf.
I could go for anything good based on llama2-7b I choosed this one without thinking so if you have a model in mind where this problem (slow generation for 2000 token prompt) wouldn't occur I'm happy to switch.
Thanks
Beta Was this translation helpful? Give feedback.
All reactions