llama-server keeps loading and unloading the model from RAM: how to keep it in VRAM? #12800

zeozeozeo · 2025-04-07T14:23:45Z

zeozeozeo
Apr 7, 2025

I am using llama.cpp 6bf28f0 with an RX 570 GPU (Vulkan backend).

When starting the server, initially it loads the entire model into RAM, the memory usage spikes, then it seems to upload it to VRAM (likely for the warmup run?), the memory usage goes back to normal, and I can inference it like normal. However, after the first inference request completes, after a short period of time (3-5 seconds) it starts rapidly unloading the model back to RAM, which makes the memory usage spike. It doesn't go down until the next request, which has me waiting for the model to be loaded into VRAM again.
I am running it without any flags, but any combination of --mlock and --no-mmap didn't seem to have any drastic changes. It still keeps juggling the model between RAM and VRAM.

Is there a way to prevent this behavior?
Logs: https://pastebin.com/Tm03CZyp (the unload into RAM starts to happen after the srv update_slots: all slots are idle message it seems like)

Answered by netrunnereve

May 17, 2025

You don't have a monitor attached to your 570 right? FYI you can fix this by setting amdgpu.runpm=0, the GPU is basically sleeping and when you run radeontop you're making it stay awake. I have a longer comment about this here.

View full answer

zeozeozeo · 2025-04-07T17:51:54Z

zeozeozeo
Apr 7, 2025
Author

I think I figured out why this happens, and it doesn't seem to be an issue of llama.cpp. Instead, it seems like the linux amdgpu driver is periodically offloading the contents of VRAM into GTT because of DPM. Setting the amdgpu.dpm=0 kernel flag seems to fix it, however that forces the GPU to run at the lowest power mode (which makes the GPU run ~4x slower). I have also tried decreasing the GTT to 64, 1024 and 2048 MB by setting the amdgpu.gttsize=<size_in_mb> kernel flag (driver doesn't let you set values lower than 32MB). It seems like you need at least 1024MB of GTT to complete the 1GB partial allocation made by llama.cpp, however it does not seem to affect how much of the model gets offloaded into system RAM (still unloads the contents of the entire VRAM, seemingly spilling over the GTT (???)).

I have found an extremely hacky solution that for some reason works. By running radeontop in the background before launching the llama.cpp server, it somehow tricks the driver into not offloading the VRAM contents to the GTT! I haven't found any other solution that works while keeping DPM on.

Dockerfile for the radeontop background hack: https://github.com/zeozeozeo/radeontop-docker-hacks

Will mark as solved for now because the issue seems more related to the AMD driver.

1 reply

0cc4m Apr 8, 2025
Collaborator

Please open an issue about the behaviour on https://gitlab.freedesktop.org/mesa/mesa/-/issues, hopefully it can be fixed.

netrunnereve · 2025-05-17T00:48:06Z

netrunnereve
May 17, 2025
Collaborator

You don't have a monitor attached to your 570 right? FYI you can fix this by setting amdgpu.runpm=0, the GPU is basically sleeping and when you run radeontop you're making it stay awake. I have a longer comment about this here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-server keeps loading and unloading the model from RAM: how to keep it in VRAM? #12800

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

llama-server keeps loading and unloading the model from RAM: how to keep it in VRAM? #12800

Uh oh!

zeozeozeo Apr 7, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

zeozeozeo Apr 7, 2025 Author

Uh oh!

Uh oh!

0cc4m Apr 8, 2025 Collaborator

Uh oh!

netrunnereve May 17, 2025 Collaborator

zeozeozeo
Apr 7, 2025

Replies: 2 comments 1 reply

zeozeozeo
Apr 7, 2025
Author

0cc4m Apr 8, 2025
Collaborator

netrunnereve
May 17, 2025
Collaborator