Offloading model layers to the GPU does not reduce the RAM load. Is this normal behavior? #6496

Folko-Ven · 2024-04-04T22:17:56Z

Folko-Ven
Apr 4, 2024

Win11, cuBLAS, latest commit.
Despite adding “–gpu-layers 3” and observing the video memory load at 6.8GB, the RAM consumption did not change at all, so all my dreams of running large models went up in smoke :( So, my question is - is this normal behavior? Does offloading layers to the graphics card really not affect RAM consumption?

Answered by Folko-Ven

Apr 5, 2024

@slaren @phymbert
I conducted testing with another model that fully fit into RAM.
You were right, offloading to the GPU does indeed reduce RAM usage, although not as effectively as I had hoped.
Apparently, the model I wanted to launch did not fit, even considering the offloading to the GPU.
I apologize for wasting your time unnecessarily.

View full answer

Dampfinchen · 2024-04-05T08:15:25Z

Dampfinchen
Apr 5, 2024

Try --no-mmap

1 reply

Folko-Ven Apr 5, 2024
Author

When using --no-mmap, the computer freezes as soon as the ram runs out. At the same time, the gpu memory load is strange low ~2GB

phymbert · 2024-04-05T10:37:06Z

phymbert
Apr 5, 2024
Collaborator

Tensors offloaded to VRAM are normally unmaped:

https://github.com/ggerganov/llama.cpp/blob/a307375c02cac45cff53cf2520330b43fecc7718/llama.cpp#L3434-L3442

but it can be improved indeed:
https://github.com/ggerganov/llama.cpp/blob/a307375c02cac45cff53cf2520330b43fecc7718/llama.cpp#L1238-L1258

When you unmap a file, the operating system removes the mapping from your process’s virtual memory space. However, the data that was loaded into memory might still remain in the system’s page cache

6 replies

slaren Apr 5, 2024
Maintainer

In practice I doubt unmapping the unused regions will make much of a difference. Disabling mmap will already get the lowest memory usable possible, if that doesn't help then you probably just need more RAM.

Folko-Ven Apr 5, 2024
Author

@slaren
To begin with, I would simply like to understand whether offloading layers to the GPU should lead to a reduction in RAM usage? At least in theory? Or do the layers need to be copied in RAM AND GPU?

Currently, I am trying to load a model that is 59GB in size. I have 64GB of RAM (63.2GB available), but 5GB of it is occupied by Windows 11. I am attempting to offload three layers (–gpu-layers 3) and I see the video memory being loaded (approximately 6.8GB out of 8GB). However, the system loads all the available RAM and then starts thrashing my ssd through the paging file.

slaren Apr 5, 2024
Maintainer

For the most part, the offloaded layers do not use RAM. The size of the CPU and GPU buffers is printed during loading, that should give you an indication of how much of each type of memory is being used for the model.

Folko-Ven Apr 5, 2024
Author

@slaren @phymbert
I conducted testing with another model that fully fit into RAM.
You were right, offloading to the GPU does indeed reduce RAM usage, although not as effectively as I had hoped.
Apparently, the model I wanted to launch did not fit, even considering the offloading to the GPU.
I apologize for wasting your time unnecessarily.

Answer selected by Folko-Ven

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Offloading model layers to the GPU does not reduce the RAM load. Is this normal behavior? #6496

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Offloading model layers to the GPU does not reduce the RAM load. Is this normal behavior? #6496

Uh oh!

Uh oh!

Folko-Ven Apr 4, 2024

Replies: 2 comments · 7 replies

Uh oh!

Dampfinchen Apr 5, 2024

Uh oh!

Uh oh!

Folko-Ven Apr 5, 2024 Author

Uh oh!

phymbert Apr 5, 2024 Collaborator

Uh oh!

slaren Apr 5, 2024 Maintainer

Uh oh!

Folko-Ven Apr 5, 2024 Author

Uh oh!

slaren Apr 5, 2024 Maintainer

Uh oh!

Folko-Ven Apr 5, 2024 Author

Folko-Ven
Apr 4, 2024

Replies: 2 comments 7 replies

Dampfinchen
Apr 5, 2024

Folko-Ven Apr 5, 2024
Author

phymbert
Apr 5, 2024
Collaborator

slaren Apr 5, 2024
Maintainer

Folko-Ven Apr 5, 2024
Author

slaren Apr 5, 2024
Maintainer

Folko-Ven Apr 5, 2024
Author