Replies: 4 comments 18 replies
-
Can you provide a TLDR how this works? |
Beta Was this translation helpful? Give feedback.
-
@ggerganov
Overall, SparseGPT induces only a minor increase in perplexity post-pruning (less than or equal to 1.5), indicating that it can achieve significant weight reduction without sacrificing much accuracy. |
Beta Was this translation helpful? Give feedback.
-
SparseGPT performs unstructured zeroing where model compilation is really needed to realize the gains. Structured pruning is available at https://github.com/horseee/LLaMA-Pruning where llama.cpp could likely use the results more directly. I’m not sure whether that implementation or the integration at https://github.com/VainF/Torch-Pruning is better, or if they are identical. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
https://github.com/AlpinDale/sparsegpt-for-LLaMA
https://arxiv.org/abs/2301.00774
"We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models."
Looks like someone is has implemented SparseGPT for the Llama model. If I understand correctly that means we can cut in half the size of the llama models without significant loss of precision.
I want to know what you think about it and if you're planning on testing the perplexity of it VS a "normal" sized Llama model.
PS: In less than a month, 65B Llama will work on the super nintendo 😄
Beta Was this translation helpful? Give feedback.
All reactions