Replies: 1 comment 1 reply
-
Use |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I see that irrespective of the
ngl
value, the KV buffer is always allocated on the GPU. I want to have an_ctx
that is larger than the GPU VRAM and offload layers selectively, but I see that the buffer is always allocated on the KV not allowing this to happenCommand 1 (all gpu layers) :
./build/bin/llama-batched-bench -m models/llama-2-7b-Q4_0_4_8_aarch64.gguf --no-display-prompt --ignore-eos -fa -b 24576 -ub 24576 -c 24576 -npp 256 -ntg 256 -npl 48 -ngl 33 -t 72
Output 1:
Command 2 (no gpu layers):
./build/bin/llama-batched-bench -m models/llama-2-7b-Q4_0_4_8_aarch64.gguf --no-display-prompt --ignore-eos -fa -b 24576 -ub 24576 -c 24576 -npp 256 -ntg 256 -npl 48 -ngl 0 -t 72
Output 2:
Beta Was this translation helpful? Give feedback.
All reactions