Replies: 6 comments 2 replies
-
For a 13B LLaMA model quantized with |
Beta Was this translation helpful? Give feedback.
-
sidenote: the model you chose is pretty old, use |
Beta Was this translation helpful? Give feedback.
-
@longregen - re benchmarking etc It seems that currently all inference testing, perplexity, (soon) hellaswag scoring etc is done manually by devs, and latency on individual developers machines. Regression bugs would be no surprise. Hallucinations and faulty inference generations would be easy for C++Devs testing their own code to miss. IMO a better solution would be
GG put the call out for "Azure CI" (which isn't what is actually required - the CI/CD piece is working fine now - "MLOps benchmarking and inference quality testing" is what is required). Though likely immersed in in the C++ code didn't respond to comms regarding the MLOps side of things. Then decided to DIY it in bash. See 6460067 I've spent a good bit of time investigating the short to medium term MLOps needs going forward - and have done 2 code spikes; a cloud scale medium term plan in node.js llama-cpp-ci-bench and a quick fix python tool - scorecard. Neither have gotten much interest. A fit-for-purpose MLOps setup requires more than free tier amount of cloud compute... and is more than a few evenings works (more like a few weeks). So this is not a "just put in a PR" type of thing That said, as an opensource SOTA project perhaps the culture and focus is more about pushing the envelope, proving out new territory, implementing the latest arVix SOTA papers into C++ code. Cutting edge, caveat emptor. Though the world is desperately wanting dependable and performant inference engines to build projects upon. |
Beta Was this translation helpful? Give feedback.
-
If anyone is interested, these are my run times in ~beefy consumer hardware with current master:
|
Beta Was this translation helpful? Give feedback.
-
Ryzen 7950x3D , RTX 3090 ( win 11 )
models q4k_m
|
Beta Was this translation helpful? Give feedback.
-
Any benchmark should be done at max context, as Llama.cpp suffers severe performance degradation once the max context is hit. Edit: The degradation is not generation speed, but prompt processing speed. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear all,
While comparing TheBloke/Wizard-Vicuna-13B-GPTQ with TheBloke/Wizard-Vicuna-13B-GGML, I get about the same generation times for GPTQ 4bit, 128 group size, no act order; and GGML, q4_K_M. In both cases I'm pushing everything I can to the GPU; with a 4090 and 24gb of ram, that's between 50 and 100 tokens per second (GPTQ has a much more variable inference speed; GGML is pretty steady at ~82 tokens per second).
Is this a realistic comparison? In that case, congratulations! I have watched the project grow over time and it gives me much more peace of mind to be able to run these models without all the python overhead and impossible-to-install libraries.
Thanks!
PS: I'm interested on seeing regular benchmarks being created, as discussed here #2038 -- but it doesn't look like this is happening any time soon?
Beta Was this translation helpful? Give feedback.
All reactions