Cuda not utilized for token generation but only for prompt processing #3027
-
Hi, llm_load_print_meta: model ftype = mostly Q4_K - Medium The problem is that when I look at the usage under Ubuntu or Windows I see that the GPU is working only when llamacpp processes the prompt, but when it's generating tokens all the work is done by the CPU, while the GPU is idle. I run the test with: ./main -t 7 -ngl 22 -m /LLM/models/airoboros-l2-70b-2.1.Q4_K_M.gguf --color -c 4096 -b 1024 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Write a short story about llamas" I have the same problem even with 13B models, that fit 100% into VRAM. Even with 43/43 layers offloaded the token generation is done only by the CPU. Can anyone help me understand whether this is the expected behavior or a bug? Thank you |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
This makes me pretty skeptical that your method of measuring GPU usage is actually accurate. Also, from what I know, when offloading all layers you only want one thread. I think a patch was recently merged to make it so when all layers are offloaded it'll default to that behavior, but since you're using How are you measuring GPU usage in Linux? |
Beta Was this translation helpful? Give feedback.
-
Thank you for checking my question. I just tried again the 13b model with -t 1, and indeed the token generation is done 100% with the GPU. Previously I tried with oobabooga under Windows, and probably I messed up something with the model config. Sorry, my fault. With the 70b model, however, I use the -t 7 on my 8 core cpu because I can't put all the model in the VRAM, but I can offload only 22 layers. To see the usage I use nvtop or nvidia-smi. I just tried again and what I see is this: When the prompt evaluation starts I see the GPU working, but when it ends and start generating tokens I see the GPU at 0%, with small spikes at 1 or 2% now and then, and only the CPU active. I probably messed up something in the configuration, but I don't see any obvious error. When llamacpp starts with the 70b model it says: llm_load_tensors: offloading 22 repeating layers to GPU so the cuda configuration seems ok. I expected to see the GPU working also during the token generation phase, along with the CPU. |
Beta Was this translation helpful? Give feedback.
Not a problem.
I don't think you did anything wrong here. Basically layers have to be evaluated sequentially for token generation so however many layers are on the CPU will get evaluated, then the ones on the GPU. So the GPU will be sitting idle for around 3/4 of the time when you're offloading 22 layers with brief spikes. Depending on the tool, if it just check the GPU usage periodically, it's pretty likely to miss those brief periods of activity.
Particularly when running on the…