Replies: 1 comment 1 reply
-
You can calculate this yourself relatively easily. The main bottleneck is getting the data onto the GPU. Each token requires traversing every layer in the model, so basically everything is required. The Q4_K_S Falcon 180B is about 101.4GB. A PCIe Gen 4 x16 slot has a maximum bandwidth of 32GB/sec. So even if the actual calculation occurred instantly and you didn't have to worry about every transferring data back to RAM the max you could get is about a token every 0.333sec. (It might be a tiny bit better than that since you could keep 5 layers on the GPU and only stream in all the rest.) This is a rough back-of-the-envelope calculation. Anyway, I guess the TL;DR is wait for PCIe Gen 6-7. 6 should have a max bandwidth of 128GB/sec. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a 9900k with 128G ram and a rtx 4070 with 12G vram. llamma.cpp runs the falcon 180b chat model (4b quantized from blokes last upload) but only allows 6 layer GPU offload resulting in about 0.3 to 0.5token/sec throughput while the GPU sits there mostly unloaded which is not usable. The feasiblity question I have is would it be possible to dynamically offload every layer of the model into the GPU for compute by pipelining the streaming of the layer weights into the GPU together with the GPU number crunch compute of the layers. If feasible I think there is a potential for around x10 speedup (5 tok/sec) on my setup if the GPU is kept fully loaded and CPU is not relied on for any compute which would make it actually usable.
Beta Was this translation helpful? Give feedback.
All reactions