-
I would like to use a very large context (100k+ tokens) with my models, but can not for even small models because of VRAM limitations. Is it possible to configure llama.cpp so it has lower VRAM cost per token of context size? |
Beta Was this translation helpful? Give feedback.
Answered by
zhentaoyu
Aug 6, 2024
Replies: 1 comment 1 reply
-
You can try |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
curvedinf
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
You can try
-nkvo
to store the kv cache on CPU and use-fa
flash-attn to save CPU memory, especially for the first token. I did it on intel GPUs before and tried to store transposed kv cache and it could accelerate the next-token generation speed a little.