llm_load_tensors: VRAM used: 7337 MB? i have RTX 4060 laptop with 8188MiB, system is using only 77mb so how can this be 7400mb only and OOM? #3190
-
i finally got cuda offload working on ubuntu 22.04 without docker BUT i can only offload 40 layers of the 43 layers total. how do i offload all 43 without crashing?
at 41 / 43 layers, its still OOM. maybe i shld ask how do i make sure my ubuntu is not using 75mb or anything of the GPU? i'm using HP victus laptop and "hybrid" mode in bios setting. i thought discrete is using gpu 100%? so using hybrid. pls correct me if i am wrong and appreciate all help in getting this working without OOM.
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
If you want to "free" that 77M, you have to stop the desktop environment service and work from physical console, because "desktop" is rendered through the GPU too, and it takes away from its memory. I believe the You seem to be very close to fitting the whole thing on the GPU, check if |
Beta Was this translation helpful? Give feedback.
If you want to "free" that 77M, you have to stop the desktop environment service and work from physical console, because "desktop" is rendered through the GPU too, and it takes away from its memory.
I believe the
VRAM used
value only shows the memory used by "model" itself, additional memory is used for couple more things.You seem to be very close to fitting the whole thing on the GPU, check if
-b 1
helps ( but it will make prompt processing slower, you can fiddle with this value to see how high can you go before oom ), you can also try reducing the context size, and if none of that helps, use a smaller quant of the model.