feasibility question for streaming gpu offloading #3745

ghost · 2023-10-23T14:51:40Z

ghost
Oct 23, 2023

I have a 9900k with 128G ram and a rtx 4070 with 12G vram. llamma.cpp runs the falcon 180b chat model (4b quantized from blokes last upload) but only allows 6 layer GPU offload resulting in about 0.3 to 0.5token/sec throughput while the GPU sits there mostly unloaded which is not usable. The feasiblity question I have is would it be possible to dynamically offload every layer of the model into the GPU for compute by pipelining the streaming of the layer weights into the GPU together with the GPU number crunch compute of the layers. If feasible I think there is a potential for around x10 speedup (5 tok/sec) on my setup if the GPU is kept fully loaded and CPU is not relied on for any compute which would make it actually usable.

KerfuffleV2 · 2023-10-23T18:17:17Z

KerfuffleV2
Oct 23, 2023
Collaborator

You can calculate this yourself relatively easily. The main bottleneck is getting the data onto the GPU. Each token requires traversing every layer in the model, so basically everything is required. The Q4_K_S Falcon 180B is about 101.4GB. A PCIe Gen 4 x16 slot has a maximum bandwidth of 32GB/sec. So even if the actual calculation occurred instantly and you didn't have to worry about every transferring data back to RAM the max you could get is about a token every 0.333sec. (It might be a tiny bit better than that since you could keep 5 layers on the GPU and only stream in all the rest.) This is a rough back-of-the-envelope calculation.

Anyway, I guess the TL;DR is wait for PCIe Gen 6-7. 6 should have a max bandwidth of 128GB/sec.

1 reply

ghost Oct 23, 2023

Thanks for you answer! I was very close to pulling up some of the accell code and seeing if I could hack a streaming compute into it but then got stopped short when I realized I dont know enough about what actually happens when evaluating the models (my simplistic view was just a bunch of vector dot products against weights and evolving model state). One token going through all layers at a time in series is clearly a showstopper on the whole idea from the get go.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feasibility question for streaming gpu offloading #3745

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

feasibility question for streaming gpu offloading #3745

Uh oh!

ghost Oct 23, 2023

Replies: 1 comment · 1 reply

Uh oh!

KerfuffleV2 Oct 23, 2023 Collaborator

Uh oh!

ghost Oct 23, 2023

ghost
Oct 23, 2023

Replies: 1 comment 1 reply

KerfuffleV2
Oct 23, 2023
Collaborator