(Falcon 180B) Flexgen throughput strategy w/ everyman desktop combo + Quant's?? #3047

BarfingLemurs · 2023-09-06T15:29:04Z

BarfingLemurs
Sep 6, 2023

Hi, this project sees 1 t/s for a GPT3 (chat gpt) sized model, such as OPT175B

project: https://github.com/FMInference/FlexGen.

It supposedly requires 400gb of ram and uses a single 3090 to run it in f16.

I'm currently wondering if this could run falcon 180B well at q2_k (or any) with an everyman desktop PC setup: 32-64gb ram, single 3090 combo. Could we use a flexgen strategy for improvements in speed?

Is such a strategy unviable? Does offloading layers have the ability to work like flexgen?

Green-Sky · 2023-09-06T16:59:51Z

Green-Sky
Sep 6, 2023
Collaborator

without looking at flexgen, I can tell you that for prompt processing scenarios, the model does not need to be fully loaded in ram. If the batch size is large enough and the ssd/pci-e link fast enough, the model can be streamed for processing.
Offloading layers from the cpu to the gpu(s) does indeed reduce the amount of ram needed (by also using vram). (if that is what you mean with offloading)

0 replies

KerfuffleV2 · 2023-09-06T18:30:01Z

KerfuffleV2
Sep 6, 2023
Collaborator

You can basically calculate how slow it would be. a Q2_K 70B LLaMA2 model is ~28.6GB - let's just call it 29GB. 70 * 2.58 = 180.6 so 29GB * 2.58 = 74.82GB. Let's say 75GB. Plus a couple GB for stuff like the KV cache.

Whatever you can't fit in your RAM + VRAM will have to be streamed from storage. If you want to evaluate everything on the GPU (rather than offloading X layers and running the rest on CPU) that will have to be streamed to the GPU.

If your system only has 32GB RAM then with a 3090 that gives you 32 + 24 = 56GB to work with (probably more like 50GB for the OS, other apps). To determine how fast you can generate tokens, you can just calculate storage speed + PCI link speed for the data that has to be streamed. It's almost certainly going to be limited by that, not actually calculating the result.

On the other hand, if your system has 64GB RAM then you 64 + 24 = 88GB then you can probably just fit the whole 180B Q2_K model in memory if you don't mind using the X layers on GPU, X layers on CPU approach I mentioned.

I'm pretty sure llama.cpp doesn't support streaming everything to the GPU - probably because it would actually be slower than just running those layers on CPU.

This isn't an exact answer, but it should be in the ballpark. So the short answer is it looks to be actually pretty feasible for a machine with 64GB RAM and a 3090. With DDR5 memory and a fast processor, 1tps doesn't sound that far-fetched either.

edit: A little more information on why streaming everything to GPU would be slower: PCIe Gen4 x16 has a theoretical max transfer rate of 32GB/sec so if you streamed the whole 74.8GB model to it once per token, you could do this at most once every 2.33sec. Since you could leave some layers on it even if you were streaming, it wouldn't be quite that bad but it's pretty easy to see how just the PCIe link speed limit would make 1 TPS impossible. And this is assuming you get exactly the theoretical maximum too, which probably isn't that realistic.

2 replies

BarfingLemurs Sep 6, 2023
Author

Thanks guys!

So, that would be pretty neat if it already runs 1t/s on cpu. If the same rope tricks are applied, maybe we may find a higher degree of context awareness, (it is 2.57x 70b llama!) using it for ingesting books and caching.. so low generation t/s could be less bothersome.

But 1 t/s is so sad still for me.. lol
2 t/s is perfect
https://tokens-per-second-visualizer.tiiny.site/

Maybe speculative sampling could help!

the batch size is large enough and the ssd/pci-e link fast enough, the model can be streamed for processing.

Is this technique useful for saving space if you have fit model on ram+vram?

I guess flexgen will not improve anything here. Its surprising the CPU processor is still usually not the bottleneck, but maybe it might be for this model.

KerfuffleV2 Sep 7, 2023
Collaborator

If the same rope tricks are applied, maybe we may find a higher degree of context awareness,

Hmm, maybe. I'm not sure the bigger size necessary has a relationship with dealing with larger context sizes. It's also much harder to train gigantic models on large context sizes. That's why you don't see stuff like 32K context 70B models even.

The model is on HF's leaderboard, it doesn't look that impressive though. It's near the top with the 70B LLaMa2 models but isn't really outperforming them. Maybe with fine tuning it will, but it doesn't really look like it's going to be an astounding difference. (Also if you're using Q2_K that's going to hurt the quality substantially most likely.)

Maybe speculative sampling could help!

Yes, this looks like a really good use case as long as you can get a good acceptance rate. Of course, then you'd need to be able to fit... I guess the 180B and a 40B so 64GB RAM probably wouldn't be enough anymore.

Is this technique useful for saving space if you have fit model on ram+vram?

I think they were talking about the case where you didn't have enough RAM+VAM to fit the model and had to stream it from storage. It's basically never going to be something you'd want to do, it might be something you have to do in order to run the model. So you wouldn't use it if you can fit the model in memory.

Its surprising the CPU processor is still usually not the bottleneck, but maybe it might be for this model.

I wouldn't think so. If you're memory bandwidth (or PCIe link, or storage bandwidth limited) increasing the size isn't going to change that.

staviq · 2023-09-06T19:27:32Z

staviq
Sep 6, 2023

I would like to add, if you want to stretch the capabilities of your hardware, if you run Linux, either without desktop environment, or you simply stop the desktop environment service and work directly with console, you can squeeze quite a bit of performance by doing that, because the RAM usage would drop, but more importantly, you would have the entire GPU for exclusive access, and idle VRAM usage would literally be 0, which is never the case for Windows or Linux with desktop environment running.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(Falcon 180B) Flexgen throughput strategy w/ everyman desktop combo + Quant's?? #3047

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

(Falcon 180B) Flexgen throughput strategy w/ everyman desktop combo + Quant's?? #3047

Uh oh!

BarfingLemurs Sep 6, 2023

Replies: 3 comments · 2 replies

Uh oh!

Green-Sky Sep 6, 2023 Collaborator

Uh oh!

Uh oh!

KerfuffleV2 Sep 6, 2023 Collaborator

Uh oh!

BarfingLemurs Sep 6, 2023 Author

Uh oh!

KerfuffleV2 Sep 7, 2023 Collaborator

Uh oh!

staviq Sep 6, 2023

BarfingLemurs
Sep 6, 2023

Replies: 3 comments 2 replies

Green-Sky
Sep 6, 2023
Collaborator

KerfuffleV2
Sep 6, 2023
Collaborator

BarfingLemurs Sep 6, 2023
Author

KerfuffleV2 Sep 7, 2023
Collaborator

staviq
Sep 6, 2023