(Falcon 180B) Flexgen throughput strategy w/ everyman desktop combo + Quant's?? #3047
Replies: 3 comments 2 replies
-
without looking at flexgen, I can tell you that for prompt processing scenarios, the model does not need to be fully loaded in ram. If the batch size is large enough and the ssd/pci-e link fast enough, the model can be streamed for processing. |
Beta Was this translation helpful? Give feedback.
-
You can basically calculate how slow it would be. a Q2_K 70B LLaMA2 model is ~28.6GB - let's just call it 29GB. 70 * 2.58 = 180.6 so 29GB * 2.58 = 74.82GB. Let's say 75GB. Plus a couple GB for stuff like the KV cache. Whatever you can't fit in your RAM + VRAM will have to be streamed from storage. If you want to evaluate everything on the GPU (rather than offloading X layers and running the rest on CPU) that will have to be streamed to the GPU. If your system only has 32GB RAM then with a 3090 that gives you 32 + 24 = 56GB to work with (probably more like 50GB for the OS, other apps). To determine how fast you can generate tokens, you can just calculate storage speed + PCI link speed for the data that has to be streamed. It's almost certainly going to be limited by that, not actually calculating the result. On the other hand, if your system has 64GB RAM then you 64 + 24 = 88GB then you can probably just fit the whole 180B Q2_K model in memory if you don't mind using the X layers on GPU, X layers on CPU approach I mentioned. I'm pretty sure llama.cpp doesn't support streaming everything to the GPU - probably because it would actually be slower than just running those layers on CPU. This isn't an exact answer, but it should be in the ballpark. So the short answer is it looks to be actually pretty feasible for a machine with 64GB RAM and a 3090. With DDR5 memory and a fast processor, 1tps doesn't sound that far-fetched either. edit: A little more information on why streaming everything to GPU would be slower: PCIe Gen4 x16 has a theoretical max transfer rate of 32GB/sec so if you streamed the whole 74.8GB model to it once per token, you could do this at most once every 2.33sec. Since you could leave some layers on it even if you were streaming, it wouldn't be quite that bad but it's pretty easy to see how just the PCIe link speed limit would make 1 TPS impossible. And this is assuming you get exactly the theoretical maximum too, which probably isn't that realistic. |
Beta Was this translation helpful? Give feedback.
-
I would like to add, if you want to stretch the capabilities of your hardware, if you run Linux, either without desktop environment, or you simply stop the desktop environment service and work directly with console, you can squeeze quite a bit of performance by doing that, because the RAM usage would drop, but more importantly, you would have the entire GPU for exclusive access, and idle VRAM usage would literally be 0, which is never the case for Windows or Linux with desktop environment running. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, this project sees 1 t/s for a GPT3 (chat gpt) sized model, such as OPT175B
project: https://github.com/FMInference/FlexGen.
It supposedly requires 400gb of ram and uses a single 3090 to run it in f16.
I'm currently wondering if this could run falcon 180B well at q2_k (or any) with an everyman desktop PC setup: 32-64gb ram, single 3090 combo. Could we use a flexgen strategy for improvements in speed?
Is such a strategy unviable? Does offloading layers have the ability to work like flexgen?
Beta Was this translation helpful? Give feedback.
All reactions