Hardware specs for GGUF 7B/13B/30B parameter models #3847

GeneralKugelBlitz · 2023-10-29T10:56:45Z

GeneralKugelBlitz
Oct 29, 2023

Hi, I am thinking of trying find the most optimal build by cost of purchase + power consumption, to run 7b gguf model (mistral 7b etc) at 4-5 token/s. I would like to ask you what sort of CPU, RAM etc should I look at. I would appreciate if someone explains in which configuration is llama.cpp is supposed to work best. Like I have heard that RAM frequency and single thread perfrormance matters more than other things etc.

Please write in detail, the principle hardware specs which are relevant and by how much and their optimal config. Plus some popular tested examples of CPU models, RAM speeds etc.

Thanks.

KerfuffleV2 · 2023-10-29T15:00:16Z

KerfuffleV2
Oct 29, 2023
Collaborator

Hardware requirements for 7B quantized models are very modest. For how much memory you need, you can look at the model file sizes for a rough estimate. Then add a few GB, long context sizes will use more. Example (using one of the best available 7B models): https://huggingface.co/TheBloke/dolphin-2.1-mistral-7B-GGUF/tree/main

I tried a Q4_K_M model on my tiny Raspberry Pi4 with 8GB RAM, got 0.75 tokens/sec.

Full GPU offloading on a AMD Radeon RX 6600 (cheap ~$200USD) GPU with 8GB VRAM: 33 tokens/sec.

Only using CPU on a Ryzen 5700G (~$175USD) 11 tokens/sec.

That's a previous gen CPU, so only uses DDR4 RAM. What I have is fairly fast (‎CMK64GX4M2D3600C18 - Cosair DDR4-3600 overclocked to 4000). Costs about ~$120USD, and with 64GB RAM you can run up to 70B models (I get about 0.8 tokens/sec).

Having even a fairly weak GPU is helpful even if you can't offload much, since it really speeds up processing long prompts.

Anyway, the requirements for 5TPS on 7B models are very modest. If you wanted to be able to run larger models like 70B even slowly, then RAM is probably the main factor. You can build a system that's adequate probably for around $600.

2 replies

ghost Oct 29, 2023

Also can consider older HW for budget builds. I have a Q9650 12G RAM rig in a 14 year old Shuttle case + 8G VRAM GTX1070 (~7 years old) running a solid 25-30 t/s on the Mistral based models. 1070s should be around $100 on ebay, CPU is almost irrelevant for the Mistral 7G models if you use an 8G VRAM GPU Mistral fits into 8G even with larger context size of 8K with Q6_K quant.

ghost Oct 29, 2023

One other possibly important point if economy is the goal. On Mistral 7B the 1070 runs 25-30 tok/sec off ~100 watts, a 4070s runs 50-60 tps off ~150 watts, and a CPU is going to most likely be around 10x to 100x less power efficient (similar power, much much slower), so over time with a lot of running you will net out saving power $$$ with the GPU vs running CPU only.

PallHaraldsson · 2023-10-29T16:09:39Z

PallHaraldsson
Oct 29, 2023

I see no support for 1-bit quantization (which is not yet mainstream), in the GGUF format but I see it claimed the format promises "no more breaking changes":

https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md

I see Microsoft's
BitNet: Scaling 1-bit Transformers for Large Language Models
https://arxiv.org/pdf/2310.11453.pdf

BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers [..]
We are also interested in applying BitNet in other architectures (e.g., RetNet [SDH+23]) for training large language models.

as the future. It promises 4 times larger models than 4-bit quantization (or 3 times larger than 3-bit, 16-times larger than 16-bit etc. apparently), for same memory budget, or targeting 1/4th the memory, if I understand correctly. E.g. potentially 140B models on 32 GB RAM. Since it's just bits, not much hardware support is needed, maybe not even 16-bit float support (or can be emulated) or anything GPU specific, i.e. mostly bit-level instructions CPUs are already very good at.

It's up to 284 times more energy-efficient for some operations, or at least up to 86 times more (vs 16-bit floats), depending on what's the relevant metric in table 1.

While the paper doesn't mention CPUs (or GPUs), I find it very likely this is going to help for running on CPUs even. While I think 1-bit is not supported in the file GGUF format (nor likely other formats), it seems it could be extended for that. So this doesn't apply just yet for models here. I still thought I should bring this up for heads up.

I know there's a standardized format for compressed weights of neural networks, but has it been outdated for some time already with e.g. GGUF? I want to know what's the current most used format, and best for the future. With 1-bit networks I don't really see additional compression possible, except maybe run-length encoding, but I doubt even that.

6 replies

PallHaraldsson Oct 29, 2023

I realize Bitnet is for training new models from scratch, not "just a new quantization format", but I hope that would be the case later (if not with this research, then later), and either way relevant to GGUF. The question here is on "Hardware specs for GGUF 7B/13B/30B parameter models", likely some already existing models, using GGUF. llama.cpp is not just for Llama models, for lot more, I'm not sure but hoping would work for Bitnets too. I'm just so exited about Bitnets that I wanted to give heads up here. It seemed relevant, in case people are maybe thinking of the future, and possibly to buy hardware.

Training, not inference, apparently.

Good to know, I've not confirmed (have you, what gave that impression?), I think might very well help for inference too; at least in the future.

GGUF doesn't claim [no breaking changes] at all.

Good to know. I meant claimed about the format (on Reddit, so maybe I was misinformed). I mean do you confirm breaking changes will be needed in the future, at least for 1-bit? Do you see the format could be (easily) extended to that?

KerfuffleV2 Oct 29, 2023
Collaborator

have you, what gave that impression?

There isn't really much about inference in the paper you linked. I guess it may possibly make a difference for that also.

Although they only compare with 16bit and 32bit, not state of the art existing quantizations. So it's hard to say if/how much the energy savings are compared to that.

I mean do you confirm breaking changes will be needed in the future,

Speaking in my capacity as a random anonymous person on the internet with all the authority and privileges that entails, I can indeed confirm this.

On a more serious note, I can't say what will happen in the future. However, there are definitely plenty of breaking changes still possible (and even likely). Just for example, about a week ago the GGUF version got bumped from 2 to 3 (to add big endian support) which broke some stuff. Existing quantizations that have become obsolete may be removed - that would break stuff that uses them, etc.

at least for 1-bit?

I think adding a new quantization would require bumping the GGUF version, so we can probably consider that a breaking change. GGUF files don't include a list of quantizations or anything like that, just a quantization id.

Anyway, I wouldn't worry about it too much. Something like that wouldn't prevent 1-bit quantization from getting added if it's useful and someone contributes the code (or GG writes it).

PallHaraldsson Oct 29, 2023

Well 1/4th the memory (compared to 4-bit integer quantization) is obviously energy saving, while yes, most of it comes from not using floats, or as much. "To have a better understanding of the scaling efficiency of neural language models, we introduce Inference-Optimal Scaling Law. It predicts the loss against the energy consumption. We focus on the inference energy cost as it scales with the usage of the model, while the training cost is only once."

"Figure 5: BitNet is more stable than FP16 Transformer with a same learning rate (Left). The training
stability enables BitNet a larger learning rate, resulting in better convergence (Right)."

PallHaraldsson Oct 29, 2023

FYI, you're just wrong with "they only compare with 16bit and 32bit, not state of the art existing quantizations." because the explicitly state they do compare e.g. to 4-, 2- and 1-bit, [EDIT: you were answering for energy, in context, not accuracy] for accuracy (you can edit out above, and I can simplify here, to not confuse others):

We compare BitNet with state-of-the-art quantization methods, including Absmax [ DLBZ22 ], SmoothQuant [XLS+23 ], GPTQ [FAHA23], and QuIP [ CCKS23 ]. These methods are post-training quantization over an FP16 Transformer model, [..] For the weight-only quantization (i.e., GPTQ and QuIP), we experiment with W4A16 and W2A16. For weight-and-activation quantization (i.e., Absmax and SmoothQuant), we use them to quantize the FP16 Transformers to W8A8, W4A4, and W1A8. Our implementation of BitNet is binary weight 8-bit activation (W1A8), which has lower or equal bits than the baselines".

It's very interesting they lose only tiny bit, 3.3% of the average accuracy even compared to 16-bit floats, and beat all 4-, 2-, and -1 bit models on average by 5.6% to 24%. Though Hellaswag is an exception where 4-bit is tiny bit ahead and Bitnets have only 6% higher perplexity compared to 4-bit models (GPTQ, other much-quantized are way worse). "All models have the model sizes of 6.7B for a fair comparison", and I assume all their numbers can be improved even if you allow yourself e.g. double the number of weights, and their models would still then be half as small.

FYI, last year ARMv9.4-A added with its SME2:

2b/4b weight compression [likely be now redundant/outdated with at least for BitNets, possibly useful what you currently do/mainstream]
1b binary networks

KerfuffleV2 Oct 30, 2023
Collaborator

1/4th the memory (compared to 4-bit integer quantization) is obviously energy saving,

It's not as straightforward as that. Just as an example, I believe Q3_K is slower than Q4_K just because it involves more complicated calculations, it's an odd size and doesn't fit nicely into the chunks of data (usually powers of two) that CPUs and GPUs like.

It may be true in the case of Bitnet, but it's not necessarily just a general rule that less bits = less power.

you were answering for energy, in context

Correct, I was just talking about energy.

last year ARMv9.4-A added with its SME2 [...] 2b/4b weight compression

Very unlikely it matches current techniques like k-quants, AWQ, etc. Stuff that gets added to instruction sets tends to be pretty conservative.

Hardware specs for GGUF 7B/13B/30B parameter models #3847

Uh oh!

Uh oh!

GeneralKugelBlitz Oct 29, 2023

Replies: 2 comments · 8 replies

Uh oh!

KerfuffleV2 Oct 29, 2023 Collaborator

Uh oh!

Uh oh!

ghost Oct 29, 2023

Uh oh!

Uh oh!

ghost Oct 29, 2023

Uh oh!

Uh oh!

PallHaraldsson Oct 29, 2023

Uh oh!

Uh oh!

PallHaraldsson Oct 29, 2023

Uh oh!

KerfuffleV2 Oct 29, 2023 Collaborator

Uh oh!

Uh oh!

PallHaraldsson Oct 29, 2023

Uh oh!

Uh oh!

PallHaraldsson Oct 29, 2023

Uh oh!

KerfuffleV2 Oct 30, 2023 Collaborator

GeneralKugelBlitz
Oct 29, 2023

Replies: 2 comments 8 replies

KerfuffleV2
Oct 29, 2023
Collaborator

PallHaraldsson
Oct 29, 2023

KerfuffleV2 Oct 29, 2023
Collaborator

KerfuffleV2 Oct 30, 2023
Collaborator