Replies: 1 comment
-
Yes, by default llama.cpp will offload to each GPU a fraction of the model proportional to the amount of free memory available on the GPU, but you can also configure a different split with |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi. I have a 3090 (24GB) and a 4080 (16GB) on my home, and thought I should try combining them to run bigger models.
I went to aphrodite & vllm first since there are supposedly the go-tos for multi-GPU distribution, but both of them assume all GPUs have the same amount of VRAM available, so models won't load if I try to utilize them.
Does llama.cpp support uneven split of GBs/layers between multiple GPUs?
(I have slow-ish internet connection so it took ages to DL a big AWQ model. Thought I'd ask here before downloading a GGUF version.)
Beta Was this translation helpful? Give feedback.
All reactions