1.58 BitNets - a new opportunities for llamafile? #313

hrstoyanov · 2024-04-02T02:34:47Z

hrstoyanov
Apr 2, 2024

BitNets are the most exciting thing for LLM happeing right now. @jart - Llalmafile can become the BitNet leader if you get early on! The big advantages are:

Very competitive performance compared to "old school" LLMs (16-bit, 8-bit, (Q)LoRA)
Significant memory and processing savings: Eliminating "matmul", and reducing it to just adding/bit flipping!
No need for quantization (they are already quantized!)
Ideally suited for commodity CPUs (AVX), expensive GPUs no longer needed (but can still help)
Overall, BitNet LLMs will become much more affordable and accessible because of the above.

Checkout these resources:

1.58 BitNet ground-braking papers.
Early confirmation/reproduction for BitNets are: here, here and here
Early implementations: here on CPU, here llama.cpp.

mounta11n · 2024-08-16T04:19:37Z

mounta11n
Aug 16, 2024

Is here any news?

8 replies

jart Aug 21, 2024
Maintainer

Also those kernels definitely look like they're using multiplication. In order to use popcount I think it needs to be 1 bit, not 1.58 bits. You'd likely need custom hardware in order to realize the promise of "zero multiplication" 1.58 bit matmuls. The work Kawrakow is doing is pioneering that future when it inevitably becomes available. No one is more qualified than him to make it as fast as possible. But I'd be surprised if (absent the special hardware) it's 1.58 BitNet is able to improve upon his existing work.

ikawrakow Aug 21, 2024

Also those kernels definitely look like they're using multiplication.

Yes, my implementations uses integer multiplications. There is no better way to do it with off-the-shelf hardware. Quantizing the activations and using integer arithmetic is definitely faster than working directly with the float activations. But when you quantize the activations to 8 bit, you cannot simply add/subtract them based on ternary (or binary) sign as the result will overflow, so you need to start doing 8->16 bits conversions, which are surprisingly costly. Or you quantize to 16 bit, but then you run into awkwardness preparing the masks/sign bits for 16-bit operations. So, at the end of the day, the fastest way to do ternary or binary matrix multiplications on existing hardware is to just do integer multiply-add. These instructions have relatively high latencies, but throughput is good (2 instructions per cycle), so when you can fill the execution pipeline with enough of those, the high latency is masked away and the performance is very good.

If there is interest for the ternary models in llamafile I can contribute those. Personally I have considered them more a thing of curiosity rather than something one would consider using seriously. If someone trained a more capable ternary (or binary) model, than the situation would of course change. My impression is that @jart wants to keep in sync with mainline llama.cpp (correct me if I'm wrong), while I consider my ik_llama.cpp repo a place of experimentation without inhibitions to diverge from the mainline. For instance, I have added new quantization types that are better than my previous work hat I have contributed to llama.cpp (k- and i-quants). Or, as another example, I have added a new operation that models "soft-capping", which is used in, e.g., Gemma-2 models, and requires 3 operations in llama.cpp (this results in a very significant performance boost for Gemma-2 models on CPU and GPU). Unlike PR-8151 in lama.cpp, I have kept the implementations for the Bitnet-1.58b architecture, so I support Bitnter-1.58b and TriLM. Etc.

jart Aug 21, 2024
Maintainer

@ikawrakow Any time you've got an experiment that you feel is ready for a broader audience, it'll always have a home here. I do intend to maintain a certain level of harmony with llama.cpp upstream. That doesn't mean we can't include your work too. If you're not planning on upstreaming it there, then all I ask is you pick a higher number for the header enum and add [kawrakow] when submitting it here.

ikawrakow Aug 22, 2024

Btw, there is this PR in my llama.cpp clone. It is a quick hack that allows to use Q2_K_S for llama-compatible ternary models such as TriLM. Basically, the Q2_K quantization function detects if it is quantizing a ternary model and does something different that leads to an exact quantization (unlike the normal quantization path that results in significant quantization errors). In addition, a new command line option is added to ignore imatrix rules so one can quantize to Q2_K_S without an imatrix. The quantized model is somewhat larger, and performance is not quite as good as the dedicated IQ2_TN quantization type, but this approach has the benefit of avoiding conflicts in quantization types between mainline llama.cpp and llamafile, and also one gets support for all back-ends supported by llama.cpp for free.

jart Aug 22, 2024
Maintainer

Send it as a PR and I'll approve it. Better quantization with Q2_K on ternary models sounds exciting. If you can recommend a really good model to go with this change, then I can create project under https://huggingface.co/Mozilla that distributes a Q2_K llamafile for it while explaining the method and benefits in the README.

ikawrakow · 2024-08-23T09:03:26Z

ikawrakow
Aug 23, 2024

I have now submitted #552 that allows usage of Q2_K_S for the TriLM ternary models.

In terms of recommending a really good model: ternary models released so far are just toys, I haven't done much experimentation, so it is hard to make a recommendation. My guess is that it is best to go with the largest TriLM model. It has 4B parameters, but with #552 it quantizes to 1.31 GiB, has a very decent inference speed, and hence it can be a viable option even for low-end devices.

2 replies

hrstoyanov Sep 11, 2024
Author

Bit nets are becoming hot. Not sure if these guys managed to get rid of matmul, but we will have trained models soon:

https://deepsilicon.com/

hrstoyanov Sep 18, 2024
Author

Btw,
A new approach to eliminating the matmul (table lookups) for bitnets, T-MAC (Google for the Microsoft paper), is being merged into llama.cpp:
microsoft/T-MAC#24

I think @ikawrakow was sceptical about lookups before ... Still.

And this just in:

https://huggingface.co/blog/1_58_llm_extreme_quantization

ikawrakow · 2024-09-25T12:45:17Z

ikawrakow
Sep 25, 2024

I think @ikawrakow was sceptical about lookups before ... Still.

I was (and still am) skeptical for a reason. Here is what I see as quoted performance on M2-Ultra in the T-MAC repository for the 3B Bitnet-1.58b model (I have copy/pasted the graph in the T-MAC repository for your convenience here):

I don't have an M2-Ultra, but I do have an M2-Max laptop (so basically half of an M2-Ultra). Here is what I get using IQ2_BN quantization in my repository for the 3B-Bitnet-1.58b model running on the CPU:

Very similar performance as T-MAC for 1-3 threads but then, instead of saturating at ~60-65 tokens/second as they do, we a) get 99 t/s at 8 threads (50+% faster than T-MAC), and b) it does not look at all like performance is saturating as it is with T-MAC, so I wouldn't be surprised if we get 150 t/s on M2-Ultra with 16 threads (2.5X T-MAC). T-MAC saturates because the threads start fighting for the available bandwidth to load values from the lookup table(s).

0 replies

raymond-infinitecode · 2025-01-07T07:10:29Z

raymond-infinitecode
Jan 7, 2025

Any support for 1.58 bit falcon based model ?
https://huggingface.co/collections/tiiuae/falcon3-67605ae03578be86e4e87026

0 replies

1.58 BitNets - a new opportunities for llamafile? #313

Uh oh!

Uh oh!

hrstoyanov Apr 2, 2024

Replies: 4 comments · 10 replies

Uh oh!

mounta11n Aug 16, 2024

Uh oh!

jart Aug 21, 2024 Maintainer

Uh oh!

ikawrakow Aug 21, 2024

Uh oh!

jart Aug 21, 2024 Maintainer

Uh oh!

ikawrakow Aug 22, 2024

Uh oh!

jart Aug 22, 2024 Maintainer

Uh oh!

ikawrakow Aug 23, 2024

Uh oh!

hrstoyanov Sep 11, 2024 Author

Uh oh!

hrstoyanov Sep 18, 2024 Author

Uh oh!

ikawrakow Sep 25, 2024

Uh oh!

raymond-infinitecode Jan 7, 2025

hrstoyanov
Apr 2, 2024

Replies: 4 comments 10 replies

mounta11n
Aug 16, 2024

jart Aug 21, 2024
Maintainer

jart Aug 21, 2024
Maintainer

jart Aug 22, 2024
Maintainer

ikawrakow
Aug 23, 2024

hrstoyanov Sep 11, 2024
Author

hrstoyanov Sep 18, 2024
Author

ikawrakow
Sep 25, 2024

raymond-infinitecode
Jan 7, 2025