llama-server keeps loading and unloading the model from RAM: how to keep it in VRAM? #12800
-
I am using llama.cpp 6bf28f0 with an RX 570 GPU (Vulkan backend). When starting the server, initially it loads the entire model into RAM, the memory usage spikes, then it seems to upload it to VRAM (likely for the warmup run?), the memory usage goes back to normal, and I can inference it like normal. However, after the first inference request completes, after a short period of time (3-5 seconds) it starts rapidly unloading the model back to RAM, which makes the memory usage spike. It doesn't go down until the next request, which has me waiting for the model to be loaded into VRAM again. Is there a way to prevent this behavior? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
I think I figured out why this happens, and it doesn't seem to be an issue of llama.cpp. Instead, it seems like the linux amdgpu driver is periodically offloading the contents of VRAM into GTT because of DPM. Setting the I have found an extremely hacky solution that for some reason works. By running radeontop in the background before launching the llama.cpp server, it somehow tricks the driver into not offloading the VRAM contents to the GTT! I haven't found any other solution that works while keeping DPM on. Dockerfile for the radeontop background hack: https://github.com/zeozeozeo/radeontop-docker-hacks Will mark as solved for now because the issue seems more related to the AMD driver. |
Beta Was this translation helpful? Give feedback.
-
You don't have a monitor attached to your 570 right? FYI you can fix this by setting |
Beta Was this translation helpful? Give feedback.
You don't have a monitor attached to your 570 right? FYI you can fix this by setting
amdgpu.runpm=0
, the GPU is basically sleeping and when you run radeontop you're making it stay awake. I have a longer comment about this here.