(V)RAM unexpectedly high #9260

ScottMcMac · 2024-08-31T16:57:12Z

ScottMcMac
Aug 31, 2024

TL;DR: DeepSeek-Coder-V2-Instruct at Q4_K_M uses drastically more system memory than I thought I'd need based on a VRAM calculator. Is the VRAM calculator wrong, misleading, or just ignoring the system RAM, or am I just unaware of the correct way to load the model that allows it to run in roughly the reported memory? (without decreasing the context or quantizing the kv cache, or maybe the calculator assumes the latter?)

In llama.cpp compiled w/out CUDA I did:

./llama-server -m ~/models/DeepSeek-Coder-V2-Instruct-Q4_K_M-GGUF/DeepSeek-Coder-V2-Instruct-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 163840

Hello, based on https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator I expected I could use the full context of deepseek-ai/DeepSeek-Coder-V2-Instruct on a system with 512GB of RAM (and 144GB VRAM) with various quants, e.g. Q4_K_M. The calculator reports requirements for a Q4_K_M of 133.10GB for the model and 227.67 for context.

However, when I go to run the model I get the same results as discussion #8520. Specifically, it says
llm_load_tensors: CPU buffer size = 135850.84 MiB
...
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 805306368032

It seems to be allocating a buffer of 750GB, much greater than the VRAM Calculator suggested. (It tries to allocate the same sized buffer whether I put layers on the GPU or not).

The solution in #8520 was to decrease the context, but based on the VRAM-Calculator I should not have to. So, is the VRAM calculator just wrong/misleading, or am I (and OP in #8520) just not using the correct options for ./llama-server?

I tried a Q4_K_M quant from bartowski and on I made myself own. I've tried running inference with llama.cpp built for just cpu and a separate build with cuda support. I tried llama-cpp-python too, all with the same results.

One example of a command I tried to serve the model (llama.cpp built w/out CUDA support):
./llama-server -m ~/models/DeepSeek-Coder-V2-Instruct-Q4_K_M-GGUF/DeepSeek-Coder-V2-Instruct-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 163840

Thanks in advance!

wooooyeahhhh · 2024-08-31T18:45:57Z

wooooyeahhhh
Aug 31, 2024

im not sure but i think the low rank attention mechanism affects the kv size. That calculator you linked might not have been updated for it so maybe it is reporting a value lower than it actually is?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(V)RAM unexpectedly high #9260

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

(V)RAM unexpectedly high #9260

Uh oh!

Uh oh!

ScottMcMac Aug 31, 2024

Replies: 1 comment

Uh oh!

wooooyeahhhh Aug 31, 2024

ScottMcMac
Aug 31, 2024

wooooyeahhhh
Aug 31, 2024