Can I choose which layer to offload when using -ngl option? #11754

CHNtentes · 2025-02-08T08:56:04Z

CHNtentes
Feb 8, 2025

I'm using DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf and this model has 48 decoder layers, so I used -ngl 48 and the performance is like this:

llama_perf_context_print: prompt eval time = 44.94 ms / 14 tokens ( 3.21 ms per token, 311.53 tokens per second)
llama_perf_context_print: eval time = 4604.98 ms / 160 runs ( 28.78 ms per token, 34.74 tokens per second)

However, I read through the logs and found that I can actually offload 49 layers, although I don't know what the final layer is. After using -ngl 49:

llama_perf_context_print: prompt eval time = 30.16 ms / 14 tokens ( 2.15 ms per token, 464.22 tokens per second)
llama_perf_context_print: eval time = 2205.35 ms / 158 runs ( 13.96 ms per token, 71.64 tokens per second)

As you can see the performance is much better. Besides, the vram usage only increases by about 60MB. The problem is, it seems that I can only offload this 49th layer after offloading 48 layers. Is it possible to offload this 49th layer first? Or maybe 24 layers (half offloaded) plus this final layer on GPU?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can I choose which layer to offload when using -ngl option? #11754

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Can I choose which layer to offload when using -ngl option? #11754

Uh oh!

CHNtentes Feb 8, 2025

Replies: 0 comments

CHNtentes
Feb 8, 2025