Hardware specs for GGUF 7B/13B/30B parameter models #3847
Replies: 2 comments 8 replies
-
Hardware requirements for 7B quantized models are very modest. For how much memory you need, you can look at the model file sizes for a rough estimate. Then add a few GB, long context sizes will use more. Example (using one of the best available 7B models): https://huggingface.co/TheBloke/dolphin-2.1-mistral-7B-GGUF/tree/main I tried a Q4_K_M model on my tiny Raspberry Pi4 with 8GB RAM, got 0.75 tokens/sec. Full GPU offloading on a AMD Radeon RX 6600 (cheap ~$200USD) GPU with 8GB VRAM: 33 tokens/sec. Only using CPU on a Ryzen 5700G (~$175USD) 11 tokens/sec. That's a previous gen CPU, so only uses DDR4 RAM. What I have is fairly fast (CMK64GX4M2D3600C18 - Cosair DDR4-3600 overclocked to 4000). Costs about ~$120USD, and with 64GB RAM you can run up to 70B models (I get about 0.8 tokens/sec). Having even a fairly weak GPU is helpful even if you can't offload much, since it really speeds up processing long prompts. Anyway, the requirements for 5TPS on 7B models are very modest. If you wanted to be able to run larger models like 70B even slowly, then RAM is probably the main factor. You can build a system that's adequate probably for around $600. |
Beta Was this translation helpful? Give feedback.
-
I see no support for 1-bit quantization (which is not yet mainstream), in the GGUF format but I see it claimed the format promises "no more breaking changes": https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md I see Microsoft's
as the future. It promises 4 times larger models than 4-bit quantization (or 3 times larger than 3-bit, 16-times larger than 16-bit etc. apparently), for same memory budget, or targeting 1/4th the memory, if I understand correctly. E.g. potentially 140B models on 32 GB RAM. Since it's just bits, not much hardware support is needed, maybe not even 16-bit float support (or can be emulated) or anything GPU specific, i.e. mostly bit-level instructions CPUs are already very good at. It's up to 284 times more energy-efficient for some operations, or at least up to 86 times more (vs 16-bit floats), depending on what's the relevant metric in table 1. While the paper doesn't mention CPUs (or GPUs), I find it very likely this is going to help for running on CPUs even. While I think 1-bit is not supported in the file GGUF format (nor likely other formats), it seems it could be extended for that. So this doesn't apply just yet for models here. I still thought I should bring this up for heads up. I know there's a standardized format for compressed weights of neural networks, but has it been outdated for some time already with e.g. GGUF? I want to know what's the current most used format, and best for the future. With 1-bit networks I don't really see additional compression possible, except maybe run-length encoding, but I doubt even that. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I am thinking of trying find the most optimal build by cost of purchase + power consumption, to run 7b gguf model (mistral 7b etc) at 4-5 token/s. I would like to ask you what sort of CPU, RAM etc should I look at. I would appreciate if someone explains in which configuration is llama.cpp is supposed to work best. Like I have heard that RAM frequency and single thread perfrormance matters more than other things etc.
Please write in detail, the principle hardware specs which are relevant and by how much and their optimal config. Plus some popular tested examples of CPU models, RAM speeds etc.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions