Sparsegpt for Llama #521

BadisG · 2023-03-26T10:56:56Z

BadisG
Mar 26, 2023

Hello,
https://github.com/AlpinDale/sparsegpt-for-LLaMA
https://arxiv.org/abs/2301.00774

"We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models."

Looks like someone is has implemented SparseGPT for the Llama model. If I understand correctly that means we can cut in half the size of the llama models without significant loss of precision.

I want to know what you think about it and if you're planning on testing the perplexity of it VS a "normal" sized Llama model.

PS: In less than a month, 65B Llama will work on the super nintendo 😄

ggerganov · 2023-03-28T15:38:11Z

ggerganov
Mar 28, 2023
Maintainer

Can you provide a TLDR how this works?

0 replies

Topping1 · 2023-03-28T16:03:10Z

Topping1
Mar 28, 2023

@ggerganov
Via chatpdf.com
summarize the paper
The paper presents a new pruning method called SparseGPT, which can prune large-scale generative pretrained transformer (GPT) family models to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. SparseGPT works by reducing the pruning problem to a set of extremely large-scale instances of sparse regression and solving them via a new approximate sparse regression solver. The method is efficient enough to execute in a few hours on the largest openly-available GPT models (175B parameters), on a single GPU, and accurate enough to drop negligible accuracy post-pruning, without any fine-tuning. When executed on the largest publicly-available generative language models (OPT-175B and BLOOM-176B), SparseGPT induces 50-60% sparsity in one-shot, with minor accuracy loss, measured either in terms of perplexity or zero-shot accuracy. The code for SparseGPT is available for public use on GitHub.

How does SparseGPT achieve such high levels of sparsity without any retraining?
SparseGPT achieves high levels of sparsity without any retraining by reducing the pruning problem to a set of extremely large-scale instances of sparse regression. It then solves these instances via a new approximate sparse regression solver, which is efficient enough to execute in a few hours on the largest openly-available GPT models (175B parameters), on a single GPU. This method is accurate enough to drop negligible accuracy post-pruning, without any fine-tuning. For example, when executed on the largest publicly-available generative language models (OPT-175B and BLOOM-176B), SparseGPT induces 50-60% sparsity in one-shot, with minor accuracy loss, measured either in terms of perplexity or zero-shot accuracy.

show some results of perplexity scores before and after the weights pruning process
Table 1 in the paper shows the perplexity scores before and after the weights pruning process for several GPT models. Here are some examples:

For the GPT-2 model with 1.5B parameters, the baseline perplexity is 18.3, and after pruning to 50% sparsity using SparseGPT, the perplexity is 18.6.
For the GPT-2 model with 774M parameters, the baseline perplexity is 20.9, and after pruning to 50% sparsity using SparseGPT, the perplexity is 21.0.
For the GPT model with 117M parameters, the baseline perplexity is 32.7, and after pruning to 50% sparsity using SparseGPT, the perplexity is 33.0.

Overall, SparseGPT induces only a minor increase in perplexity post-pruning (less than or equal to 1.5), indicating that it can achieve significant weight reduction without sacrificing much accuracy.

7 replies

bluefireexplosion Mar 28, 2023

Thanks. By "inducing sparsity" does it mean that all tensors of the model remain the same shape, but they become sparse - i.e. have many zeros. Like 50% of the weights become 0. Is that it?

I performed some tests with a sparsified version of llama 7B utilizing this repo and some local changes, and didn't see any meaningful performance or memory improvements at 50% sparsity. I believe with the way that this code works right now, zeroing some percentage of the weights doesn't provide performance benefits, at least for GGML. It may be more apparent for higher sparsity, but I think the application for this lies elsewhere.
Right now I've been toying with improving performance/disk space constraints by shuffling the zeroed tensors that the SparseGPT pruning produces to the end of each tensor and slicing every tensor layer by layer so that the greatest number of zeroes are removed but every tensor maintains an even dimension, thus constraining the model to a smaller dimension and improving performance. I'm still working on applying this in code efficiently, but I think this would shrink the model parameter wise to the effect of what the paper mentions, and provide the memory/performance benefits that the paper promises.

To the point of the above comment, I think sparse tensors could work, but applying the sparsification and saving those weights to a .pth file via Pytorch requires some significant changes to the code, particularly requiring the removal of half precision, as Pytorch currently doesn't support half precise sparse tensors with the operations that the SparseGPT code performs. As such, I've opted to go the above route instead, but I'd be happy to hear concerns/criticism about this as I'm by no means experienced with LLMs or ML in general.

ggerganov Mar 28, 2023
Maintainer

Yes, ggml does not have a support for sparse tensors so the performance is the same regardless of the number of zeros.

BadisG Mar 29, 2023
Author

@ggerganov you think you could change the code so it will have a support for sparse tensors? It would really speed up the output results for every situations actually (even the raw llama models have a lot of zeros in it)

xaedes Apr 25, 2023
Collaborator

@BadisG

Leaving the memory layout the same one could add skip lists to directly jump to non-zero values in the vec_dot functions that are called by mat_mul for each row pair.
Not changing the mem layout will make it simpler to implement without breaking anything else.
There would at least be some performance gain from not processing some entries.

No clue whether this is worth the hastle when there are no memory usage improvements.

BadisG Apr 25, 2023
Author

The biggest disavantage to llama.cpp is that it's kinda slow, everyone can buy some RAM it's not a big deal, inference speed is all that matter imo.

xloem · 2023-04-24T11:32:49Z

xloem
Apr 24, 2023

SparseGPT performs unstructured zeroing where model compilation is really needed to realize the gains.

Structured pruning is available at https://github.com/horseee/LLaMA-Pruning where llama.cpp could likely use the results more directly. I’m not sure whether that implementation or the integration at https://github.com/VainF/Torch-Pruning is better, or if they are identical.

7 replies

BadisG Apr 25, 2023
Author

Will pruning the model reduce the ram or vram usage for inference?

xloem Apr 25, 2023

generally yes but if it is unstructured pruning and dense uncompiled matmuls are used then there are no vram gains, which is not the intent of the pruning but is the default state in llama.cpp

xloem Apr 26, 2023

i drafted this very simple attempt at pruning+distillation. likely needs tweaks to run. license is public domain. https://dpaste.com/4CATXUSF5

edit: i poked at this until https://gist.github.com/xloem/429e9a0c0f1c4dfd299e7c1747adcf35 (which kinda works) and learned a little smidge about model training
thoughts for doing things differently:

it would be quite useful to run llama-65b through a couple big datasets and create a further dataset of the hidden states and logits produced. this is a good chunk of the compute for knowledge distillation.
pruning the model and finetuning it back up should be separate commands if they aren’t algorithmically unified. when finetuning, it’s important to be able to resume from an interrupted run.
training takes a lot of ram so try using lora first next time
it’s a big task to retune a whole architecture; instead it would make sense to do this layer-by-layer, training e.g. only the two layers surrounding a removed one.
model training can be done more effectively with special purpose launchers like deepspeed or accelerate, or training-specific frameworks. it might be helpful to look for papers that advertise quick training times and learn to use the systems they train with.
gptq uses only a little data to do this; it possibly could be worth trying to reuse the same data rather than continuously pulling new. this saves the compute of generating the comparison data and also makes the system simpler to optimize.

BadisG Apr 26, 2023
Author

pruning + quantization is possible?

xloem May 4, 2023

https://arxiv.org/abs/2305.02301 describes generating data from a model to shrink it by 100s or 1000s of times

jackzhou121 · 2023-04-27T07:36:40Z

jackzhou121
Apr 27, 2023

Hello, https://github.com/AlpinDale/sparsegpt-for-LLaMA https://arxiv.org/abs/2301.00774

"We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models."

Looks like someone is has implemented SparseGPT for the Llama model. If I understand correctly that means we can cut in half the size of the llama models without significant loss of precision.

I want to know what you think about it and if you're planning on testing the perplexity of it VS a "normal" sized Llama model.

PS: In less than a month, 65B Llama will work on the super nintendo 😄

Hi, we have test sparsegpt on bloom7b, we did not found any computation or gpu memory size decreased, so how to run it on super nintendo?

4 replies

xloem Apr 27, 2023

sparsegpt performs what is called unstructured pruning where gains are only realised when using sparse matrices or model compilation. without these, computation and memory usage will only reduce when it is the structured type of pruning rather than unstructured.

BadisG Apr 27, 2023
Author

That implementation of SparseGPT for LLaMA is incomplete but for some reason he left it up without clarifying this in the readme. The model is not pruned properly.

ElvinRath May 2, 2023

Incomplete? So, the actual numbers for the pruned models might be better than those?

digiwombat May 4, 2023

PS: In less than a month, 65B Llama will work on the super nintendo 😄

I'm holding out for the Genesis version. With Blast Processing, it'll definitely do what nintendon't. 🏃🏻‍♂️ 💨

Sparsegpt for Llama #521

Uh oh!

Uh oh!

Replies: 4 comments · 18 replies

Uh oh!

ggerganov Mar 28, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Mar 28, 2023 Maintainer

Uh oh!

BadisG Mar 29, 2023 Author

Uh oh!

Uh oh!

xaedes Apr 25, 2023 Collaborator

Uh oh!

BadisG Apr 25, 2023 Author

Uh oh!

Uh oh!

Uh oh!

BadisG Apr 25, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BadisG Apr 26, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BadisG Apr 27, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Replies: 4 comments 18 replies

ggerganov
Mar 28, 2023
Maintainer

ggerganov Mar 28, 2023
Maintainer

BadisG Mar 29, 2023
Author

xaedes Apr 25, 2023
Collaborator

BadisG Apr 25, 2023
Author

BadisG Apr 25, 2023
Author

BadisG Apr 26, 2023
Author

BadisG Apr 27, 2023
Author