Slow generation for 2000 token prompt on a CPU #3178

Kpaybo · 2023-09-14T20:12:10Z

Kpaybo
Sep 14, 2023

I installed llama.cpp and I run [tulpar-7b-v0.Q4_K_M.gguf] on it.

I don't have a dedicated GPU I run it on my 12th gen i7-1255u CPU with 16gb ram.

It's good and fast when the prompt is 1 or 2 lines. But it take more than 3 minutes (!) when I put in the 2000 token prompt I created for my Usecase.

I looked at this thread: #229

Someone said : This is not a llama.cpp problem this is a 4bit problem. 8bit does not have this, sure it's slow but it starts generating right at the start. But 4bit has a delay before anything starts. GPU/CPU does not matter there's a delay with 4bit.

Should I install this then? [tulpar-7b-v0.Q8_0.gguf] But I imagine it will get slower?

Here (qwopqwop200/GPTQ-for-LLaMa#87) they were suggesting to fallback to PyTorch matmul on large input sizes. But it's for gptq. I installed a gguf.

I could go for anything good based on llama2-7b I choosed this one without thinking so if you have a model in mind where this problem (slow generation for 2000 token prompt) wouldn't occur I'm happy to switch.

Thanks

staviq · 2023-09-15T00:58:13Z

staviq
Sep 15, 2023

3 min for 2000 tokens is 11 tps, that sounds about right for CPU mode.

If you feel like something is slower than it should be, it would help if you post the main.XXXX.log file from running ./main with that model and your prompt, logs are timestamped so they show which part takes how long.

0 replies

Kpaybo · 2023-09-15T15:04:16Z

Kpaybo
Sep 15, 2023
Author

So this is normal huh

I wanted to add to my desktop app an AI functionality but for it to tell good response it needs to base them on long technical document (40 000 pages). I just take 1 pages throug search from those 40 000 page and add it to the prompt. Then it gives good result even on Q4 version of finetuned llama2.

This made me think my desktop app would work on any good pc even without GPU. Unfortunately with this problem I don't know what to do anymore lol. OpenAI api would be at least 500$ per month even with few users. I could try finetuning llama2-7b on all my 40 000 pages but people say it doesn't seem to work. Rag is always better than finetuning in this specific case.

Anyway I'll stop the rambling. Thanks for the clarification.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow generation for 2000 token prompt on a CPU #3178

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Slow generation for 2000 token prompt on a CPU #3178

Uh oh!

Uh oh!

Kpaybo Sep 14, 2023

Replies: 2 comments

Uh oh!

staviq Sep 15, 2023

Uh oh!

Kpaybo Sep 15, 2023 Author

Kpaybo
Sep 14, 2023

staviq
Sep 15, 2023

Kpaybo
Sep 15, 2023
Author