Is there a way to lower VRAM usage without lowering context size? #8879

curvedinf · 2024-08-05T19:42:16Z

curvedinf
Aug 5, 2024

I would like to use a very large context (100k+ tokens) with my models, but can not for even small models because of VRAM limitations. Is it possible to configure llama.cpp so it has lower VRAM cost per token of context size?

Answered by zhentaoyu

Aug 6, 2024

You can try -nkvo to store the kv cache on CPU and use -fa flash-attn to save CPU memory, especially for the first token. I did it on intel GPUs before and tried to store transposed kv cache and it could accelerate the next-token generation speed a little.

View full answer

zhentaoyu · 2024-08-06T02:34:40Z

zhentaoyu
Aug 6, 2024

You can try -nkvo to store the kv cache on CPU and use -fa flash-attn to save CPU memory, especially for the first token. I did it on intel GPUs before and tried to store transposed kv cache and it could accelerate the next-token generation speed a little.

1 reply

curvedinf Aug 6, 2024
Author

Thank you! Using phi3 mini at 12000 ctx is 13 GB with default options. It goes down to 9GB with flash attention and 4GB with offload kvcache off. Kvcache offloading speeds up inference a lot, so I will keep that on. FA has no downside it seems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there a way to lower VRAM usage without lowering context size? #8879

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Is there a way to lower VRAM usage without lowering context size? #8879

Uh oh!

curvedinf Aug 5, 2024

Replies: 1 comment · 1 reply

Uh oh!

zhentaoyu Aug 6, 2024

Uh oh!

Uh oh!

curvedinf Aug 6, 2024 Author

curvedinf
Aug 5, 2024

Replies: 1 comment 1 reply

zhentaoyu
Aug 6, 2024

curvedinf Aug 6, 2024
Author