-
big means that a single GPU is not able to hold all parameters, even after we use quantimization. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
yes ... you can use muti gpu. |
Beta Was this translation helpful? Give feedback.
-
You can also only offload part of the model to the GPU(s) and run the rest on CPU. Running a 70B LLaMa is possible on pure CPU with 64GB RAM. |
Beta Was this translation helpful? Give feedback.
-
With the newest build llamacpp and mod l 70b qk4_m version on rtx 3090 you can put 48 layers of 80 on GPU . I have then ~ 3 tokens /s . |
Beta Was this translation helpful? Give feedback.
-
This already works even with multiple dissimilar GPUs, for example, I'm using rtx2070 and p106-100 In case of multiple GPUs with different amounts of VRAM, you may have to fiddle a bit with Communication between GPUs is not strictly necessary. There is a PR #2470 for enabling "nvlink like" GPU to GPU communication, but people are reporting varying results, sometimes it's faster and sometimes slower. |
Beta Was this translation helpful? Give feedback.
This already works even with multiple dissimilar GPUs, for example, I'm using rtx2070 and p106-100
In case of multiple GPUs with different amounts of VRAM, you may have to fiddle a bit with
-ts
parameter to fill the VRAM to the brim on all GPUs, but it works already.Communication between GPUs is not strictly necessary. There is a PR #2470 for enabling "nvlink like" GPU to GPU communication, but people are reporting varying results, sometimes it's faster and sometimes slower.