Cuda not utilized for token generation but only for prompt processing #3027

emadeck · 2023-09-05T14:10:20Z

emadeck
Sep 5, 2023

Hi,
I'm trying to use my RTX 4080 16GB with llamacpp, and I have a strange slow speed with token generation. I compiled llamacpp with cuda support, and when I try to use a model I get this output:

llm_load_print_meta: model ftype = mostly Q4_K - Medium
llm_load_print_meta: model size = 68,98 B
llm_load_print_meta: general.name = airoboros-l2-70b-2.1
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0,23 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 28558,72 MB (+ 1280,00 MB per state)
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloaded 22/83 layers to GPU
llm_load_tensors: VRAM used: 10945 MB

The problem is that when I look at the usage under Ubuntu or Windows I see that the GPU is working only when llamacpp processes the prompt, but when it's generating tokens all the work is done by the CPU, while the GPU is idle. I run the test with:

./main -t 7 -ngl 22 -m /LLM/models/airoboros-l2-70b-2.1.Q4_K_M.gguf --color -c 4096 -b 1024 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Write a short story about llamas"

I have the same problem even with 13B models, that fit 100% into VRAM. Even with 43/43 layers offloaded the token generation is done only by the CPU.

Can anyone help me understand whether this is the expected behavior or a bug?

Thank you

Answered by KerfuffleV2

Sep 5, 2023

Sorry, my fault.

Not a problem.

When the prompt evaluation starts I see the GPU working, but when it ends and start generating tokens I see the GPU at 0%, with small spikes at 1 or 2% now and then, and only the CPU active.

I don't think you did anything wrong here. Basically layers have to be evaluated sequentially for token generation so however many layers are on the CPU will get evaluated, then the ones on the GPU. So the GPU will be sitting idle for around 3/4 of the time when you're offloading 22 layers with brief spikes. Depending on the tool, if it just check the GPU usage periodically, it's pretty likely to miss those brief periods of activity.

Particularly when running on the…

View full answer

KerfuffleV2 · 2023-09-05T15:15:46Z

KerfuffleV2
Sep 5, 2023
Collaborator

Even with 43/43 layers offloaded the token generation is done only by the CPU.

This makes me pretty skeptical that your method of measuring GPU usage is actually accurate. Also, from what I know, when offloading all layers you only want one thread. I think a patch was recently merged to make it so when all layers are offloaded it'll default to that behavior, but since you're using -t 7 it'll do what you tell it.

How are you measuring GPU usage in Linux?

0 replies

emadeck · 2023-09-05T16:15:44Z

emadeck
Sep 5, 2023
Author

Thank you for checking my question. I just tried again the 13b model with -t 1, and indeed the token generation is done 100% with the GPU. Previously I tried with oobabooga under Windows, and probably I messed up something with the model config. Sorry, my fault.

With the 70b model, however, I use the -t 7 on my 8 core cpu because I can't put all the model in the VRAM, but I can offload only 22 layers. To see the usage I use nvtop or nvidia-smi. I just tried again and what I see is this:

When the prompt evaluation starts I see the GPU working, but when it ends and start generating tokens I see the GPU at 0%, with small spikes at 1 or 2% now and then, and only the CPU active. I probably messed up something in the configuration, but I don't see any obvious error. When llamacpp starts with the 70b model it says:

llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloaded 22/83 layers to GPU

so the cuda configuration seems ok. I expected to see the GPU working also during the token generation phase, along with the CPU.

4 replies

KerfuffleV2 Sep 5, 2023
Collaborator

Sorry, my fault.

Not a problem.

When the prompt evaluation starts I see the GPU working, but when it ends and start generating tokens I see the GPU at 0%, with small spikes at 1 or 2% now and then, and only the CPU active.

I don't think you did anything wrong here. Basically layers have to be evaluated sequentially for token generation so however many layers are on the CPU will get evaluated, then the ones on the GPU. So the GPU will be sitting idle for around 3/4 of the time when you're offloading 22 layers with brief spikes. Depending on the tool, if it just check the GPU usage periodically, it's pretty likely to miss those brief periods of activity.

Particularly when running on the CPU, generation is more memory bandwidth limited than CPU limited. So if you're already saturating your memory, adding more threads will actually hurt performance. I have an AMD 3700G (8 cores, 16 threads) and I've found 5-6 threads works the best and going higher hurts performance. I recommend trying out different values to test how it affects performance for you.

You can also try offloading less layers to the GPU. You'll probably find that it makes a noticeable difference because an RTX 4080 is pretty beefy. Just to be clear, I'm saying offloading less layers will hurt your performance - this is just a test so you can see that it's actually doing something. You also can try -lv (low VRAM) which might let you offload a couple more layers. That option will slow down prompt processing a little but unless you're using very long prompts or running perplexity calculations it's usually going to help more to be able to offload more layers.

Answer selected by emadeck

emadeck Sep 6, 2023
Author

Thank you! I'll try that

jboero Dec 7, 2023

This is interesting because I have the same situation on a RTX4070ti (12GB) and the number of layers offloaded seems to have no effect. I can see the VRAM allocated and I can see the spike on prompt but then generation is 3tps max with completely idle GPU. Same behaviour when using it via OpenCL strangely. Dual Xeons and 512GB of RAM (though LR-DIMMs) in this box but max 3tps.

@emadeck did you ever see an improvement or just accept it as standard performance?

emadeck Dec 9, 2023
Author

@jboero The speed is right, in my case. Depending on the size of the model that I use I see 25-45% GPU usage for 13b 5Q_M models (all 43 layers offloaded) and 20-25% on 20b 4Q_M models (62 offloaded on 65 total). With 70b 4Q models after upgrading my Ubuntu distro I see 0-6% GPU utilization with an average of 2% (24 on 83 total). To read the load I use nvtop, and with the previous Ubuntu version I saw an average of 0% with some random spikes to 2%, now it seems to work better, and reports a more realistic load. Nvtop seems to read the GPU and CPU utilization only every second, so it misses a lot of information when the GPU load jumps a lot, like during token generation. GPU VRAM speed is soo faster than my DDR4 RAM that the GPU finishes calculating the offloaded layers and then sits idle while the CPU does the rest of the work. During prompt evaluation the load is more constant, so I see 45 to 65% GPU load, regardless of the number of layers offloaded. I suppose that prompt evaluation is always done by the GPU, since you have to fit the context in VRAM or it doesn't work, at least with CuBLAS.
I find strange that on your sistem the number of offloaded layers doesn't change the speed. I suggest you to try with a small model, like a quantized 7B one, that should fit all on 12GB of VRAM, using only 1 core for the CPU. You should see the GPU working and a good speed, then. Sadly the only way to achive high speed on bigger models is to fit all layers on the VRAM. If only one layers is processed by the CPU the speed drops significantly. I don't know what model you are using but maybe is so big that the GPU computes only a few of the layers, and so the final speed is too dependent on the CPU and RAM used for the rest of the model, to notice the difference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cuda not utilized for token generation but only for prompt processing #3027

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Cuda not utilized for token generation but only for prompt processing #3027

Uh oh!

emadeck Sep 5, 2023

Replies: 2 comments · 4 replies

Uh oh!

KerfuffleV2 Sep 5, 2023 Collaborator

Uh oh!

emadeck Sep 5, 2023 Author

Uh oh!

KerfuffleV2 Sep 5, 2023 Collaborator

Uh oh!

emadeck Sep 6, 2023 Author

Uh oh!

jboero Dec 7, 2023

Uh oh!

emadeck Dec 9, 2023 Author

emadeck
Sep 5, 2023

Replies: 2 comments 4 replies

KerfuffleV2
Sep 5, 2023
Collaborator

emadeck
Sep 5, 2023
Author

KerfuffleV2 Sep 5, 2023
Collaborator

emadeck Sep 6, 2023
Author

emadeck Dec 9, 2023
Author