Replies: 3 comments 1 reply
-
There should be no allocations after the first few evaluations. Please include a log that shows the OOM error. |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Beta Was this translation helpful? Give feedback.
1 reply
-
Thanks for answer. But if static buffers are more than free memory why won't it fail outright? It's such a waste of time. And I actually plan to use the slot KV cache elsewhere with CPU-only inference so that's fine. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to process large prompt. I tuned kv cache quantization, offloading as many layers as possible to GPU, it starts processing and all looks fine...and after few hours it fails with OOM.
nvidia-smi shows GPU memory usage by llama.cpp steadily creeping up. Is this expected behavior and if so how much reserve should I keep? Seems like 10% is needed.
Searched for similar discussions but the topic was allocation failure before submitting any prompt. This happens in the middle of processing.
Beta Was this translation helpful? Give feedback.
All reactions