(V)RAM unexpectedly high #9260
Unanswered
ScottMcMac
asked this question in
Q&A
Replies: 1 comment
-
im not sure but i think the low rank attention mechanism affects the kv size. That calculator you linked might not have been updated for it so maybe it is reporting a value lower than it actually is? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR: DeepSeek-Coder-V2-Instruct at Q4_K_M uses drastically more system memory than I thought I'd need based on a VRAM calculator. Is the VRAM calculator wrong, misleading, or just ignoring the system RAM, or am I just unaware of the correct way to load the model that allows it to run in roughly the reported memory? (without decreasing the context or quantizing the kv cache, or maybe the calculator assumes the latter?)
In llama.cpp compiled w/out CUDA I did:
./llama-server -m ~/models/DeepSeek-Coder-V2-Instruct-Q4_K_M-GGUF/DeepSeek-Coder-V2-Instruct-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 163840
Hello, based on https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator I expected I could use the full context of deepseek-ai/DeepSeek-Coder-V2-Instruct on a system with 512GB of RAM (and 144GB VRAM) with various quants, e.g. Q4_K_M. The calculator reports requirements for a Q4_K_M of 133.10GB for the model and 227.67 for context.
However, when I go to run the model I get the same results as discussion #8520. Specifically, it says
llm_load_tensors: CPU buffer size = 135850.84 MiB
...
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 805306368032
It seems to be allocating a buffer of 750GB, much greater than the VRAM Calculator suggested. (It tries to allocate the same sized buffer whether I put layers on the GPU or not).
The solution in #8520 was to decrease the context, but based on the VRAM-Calculator I should not have to. So, is the VRAM calculator just wrong/misleading, or am I (and OP in #8520) just not using the correct options for ./llama-server?
I tried a Q4_K_M quant from bartowski and on I made myself own. I've tried running inference with llama.cpp built for just cpu and a separate build with cuda support. I tried llama-cpp-python too, all with the same results.
One example of a command I tried to serve the model (llama.cpp built w/out CUDA support):
./llama-server -m ~/models/DeepSeek-Coder-V2-Instruct-Q4_K_M-GGUF/DeepSeek-Coder-V2-Instruct-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 163840
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions