Skip to content

Is there a way to lower VRAM usage without lowering context size? #8879

Closed Answered by zhentaoyu
curvedinf asked this question in Q&A
Discussion options

You must be logged in to vote

You can try -nkvo to store the kv cache on CPU and use -fa flash-attn to save CPU memory, especially for the first token. I did it on intel GPUs before and tried to store transposed kv cache and it could accelerate the next-token generation speed a little.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@curvedinf
Comment options

Answer selected by curvedinf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants