Replies: 2 comments 6 replies
-
how about SqueezeLLM? Also, does llama.cpp still fastest compare with exllama on GPU? (quantized) |
Beta Was this translation helpful? Give feedback.
-
This seems super weird, I'm not sure what he's trying to do just comparing perplexity and not accounting for file size, performance, etc. It seems like it's mostly between 4-bit-ish quantizations but it doesn't actually say that. Also, he didn't run perplexity against the same corpus as other perplexity measurements: It was run against a ".txt input file containing some technical blog posts and papers that I collected. It is a lot smaller and faster to evaluate than wikitext, but I find that it correlates perfectly with bigger evaluations." The good old source: trust me bro. "The perplexity of llama-65b in llama.cpp is indeed lower than for llama-30b in all other backends." - You can take out the "other" there, right? The perplexity for llama-65b in llama.cpp will indeed be lower than the perplexity of llama-30b in llama.cpp. If there wasn't an advantage to a model more than twice as large, why would we bother to use it? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
@oobabooga is comparing different inference engines for their perplexity and of course llama.cpp with K-quants seems to be in the lead.
https://oobabooga.github.io/blog/posts/perplexities/
Beta Was this translation helpful? Give feedback.
All reactions