You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf and this model has 48 decoder layers, so I used -ngl 48 and the performance is like this:
llama_perf_context_print: prompt eval time = 44.94 ms / 14 tokens ( 3.21 ms per token, 311.53 tokens per second)
llama_perf_context_print: eval time = 4604.98 ms / 160 runs ( 28.78 ms per token, 34.74 tokens per second)
However, I read through the logs and found that I can actually offload 49 layers, although I don't know what the final layer is. After using -ngl 49:
llama_perf_context_print: prompt eval time = 30.16 ms / 14 tokens ( 2.15 ms per token, 464.22 tokens per second)
llama_perf_context_print: eval time = 2205.35 ms / 158 runs ( 13.96 ms per token, 71.64 tokens per second)
As you can see the performance is much better. Besides, the vram usage only increases by about 60MB. The problem is, it seems that I can only offload this 49th layer after offloading 48 layers. Is it possible to offload this 49th layer first? Or maybe 24 layers (half offloaded) plus this final layer on GPU?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I'm using DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf and this model has 48 decoder layers, so I used -ngl 48 and the performance is like this:
llama_perf_context_print: prompt eval time = 44.94 ms / 14 tokens ( 3.21 ms per token, 311.53 tokens per second)
llama_perf_context_print: eval time = 4604.98 ms / 160 runs ( 28.78 ms per token, 34.74 tokens per second)
However, I read through the logs and found that I can actually offload 49 layers, although I don't know what the final layer is. After using -ngl 49:
llama_perf_context_print: prompt eval time = 30.16 ms / 14 tokens ( 2.15 ms per token, 464.22 tokens per second)
llama_perf_context_print: eval time = 2205.35 ms / 158 runs ( 13.96 ms per token, 71.64 tokens per second)
As you can see the performance is much better. Besides, the vram usage only increases by about 60MB. The problem is, it seems that I can only offload this 49th layer after offloading 48 layers. Is it possible to offload this 49th layer first? Or maybe 24 layers (half offloaded) plus this final layer on GPU?
Beta Was this translation helpful? Give feedback.
All reactions