Comparable performance of GGML with GPTQ -- Is this a fair comparison? Are there any benchmarks out there? #2424

longregen · 2023-07-27T13:15:57Z

longregen
Jul 27, 2023

Dear all,

While comparing TheBloke/Wizard-Vicuna-13B-GPTQ with TheBloke/Wizard-Vicuna-13B-GGML, I get about the same generation times for GPTQ 4bit, 128 group size, no act order; and GGML, q4_K_M. In both cases I'm pushing everything I can to the GPU; with a 4090 and 24gb of ram, that's between 50 and 100 tokens per second (GPTQ has a much more variable inference speed; GGML is pretty steady at ~82 tokens per second).

Is this a realistic comparison? In that case, congratulations! I have watched the project grow over time and it gives me much more peace of mind to be able to run these models without all the python overhead and impossible-to-install libraries.

Thanks!

PS: I'm interested on seeing regular benchmarks being created, as discussed here #2038 -- but it doesn't look like this is happening any time soon?

ikawrakow · 2023-07-27T18:52:04Z

ikawrakow
Jul 27, 2023

For a 13B LLaMA model quantized with Q4_K_M I get ~70 tokens/second on a 4080, so ~82 t/s on a 4090 sounds plausible.

0 replies

Green-Sky · 2023-07-27T19:01:54Z

Green-Sky
Jul 27, 2023
Collaborator

sidenote: the model you chose is pretty old, use
https://huggingface.co/TheBloke/WizardLM-13B-V1.2-GGML
https://huggingface.co/TheBloke/WizardLM-13B-V1.2-GPTQ
instead , it generates way better quality and is based on llama-2.
if you can choose, that is.

0 replies

ianscrivener · 2023-07-27T23:05:16Z

ianscrivener
Jul 27, 2023

@longregen - re benchmarking etc
continuous benching marking at scale and MLOps is a different skill set to LLM C/C++ development - AND you need cloud compute to do the tests. Ad hoc testing by C/C++ devs is a great first step.. but if history is any guide, this doesn't scale.

It seems that currently all inference testing, perplexity, (soon) hellaswag scoring etc is done manually by devs, and latency on individual developers machines. Regression bugs would be no surprise. Hallucinations and faulty inference generations would be easy for C++Devs testing their own code to miss.

IMO a better solution would be

on demand benchmarking from CLI for C++ Devs of the WIP on their personal repo: a range from quick tests (perplexity wiki 60) to full suite
Automated benchmarking & inference quality testing of PRs
Automated benchmarking & inference quality testing of Releases - showing code speed and quality improvements over time
Central database of tests scoring, dashboarding, raw datasets for further analysis, Jupyter samples etc, etc

GG put the call out for "Azure CI" (which isn't what is actually required - the CI/CD piece is working fine now - "MLOps benchmarking and inference quality testing" is what is required). Though likely immersed in in the C++ code didn't respond to comms regarding the MLOps side of things. Then decided to DIY it in bash. See 6460067

I've spent a good bit of time investigating the short to medium term MLOps needs going forward - and have done 2 code spikes; a cloud scale medium term plan in node.js llama-cpp-ci-bench and a quick fix python tool - scorecard. Neither have gotten much interest.

A fit-for-purpose MLOps setup requires more than free tier amount of cloud compute... and is more than a few evenings works (more like a few weeks). So this is not a "just put in a PR" type of thing

That said, as an opensource SOTA project perhaps the culture and focus is more about pushing the envelope, proving out new territory, implementing the latest arVix SOTA papers into C++ code. Cutting edge, caveat emptor. Though the world is desperately wanting dependable and performant inference engines to build projects upon.

0 replies

longregen · 2023-07-28T20:03:02Z

longregen
Jul 28, 2023
Author

If anyone is interested, these are my run times in ~beefy consumer hardware with current master:

Hardware	Model Size	Quantization	Time per token (ms)	Tokens per second
CPU (i9)	13B	q4_K_M	205	4.86
CPU (M2)	13B	q4_K_M	135	7.39
CUDA (4090)	13B	q4_K_M	13	75.13
Metal (M2)	13B	q4_K_M	38	26.22
CPU (i9)	33B	q4_K_M	480	2.08
CPU (M2)	33B	q4_K_M	308	3.25
CPU (i9)	33B	q4_0	596	1.68
CPU (M2)	33B	q4_0	292	3.42
CUDA (4090)	33B	q4_K_M	* (failed)	*
CUDA (4090)	33B	q4_1	*	*
CUDA (4090)	33B	q4_0	25	39.26
Metal (M2)	33B	q4_K_M	101	9.94
Metal (M2)	33B	q4_0	85	11.73
CPU (i9)	65B	q4_K_M	1020	0.99
CPU (M2)	65B	q4_K_M	558	1.79
Metal (M2)	65B	q4_K_M	205	4.86

0 replies

mirek190 · 2023-07-28T20:42:58Z

mirek190
Jul 28, 2023

Ryzen 7950x3D , RTX 3090 ( win 11 ) models q4k_m

Hardware	model	Tokens per second
CPU ( 7950x3D)	13B	7.94
RTX 3090	13B	62.45
CPU ( 7950x3D)	33B	3.28
RTX 3090	33B	27.55
CPU ( 7950x3D)	65B	1.52
RTX 3090 ( 45 layers on gpu)	65B	2.57

0 replies

bucketcat · 2023-07-30T16:48:38Z

bucketcat
Jul 30, 2023

Any benchmark should be done at max context, as Llama.cpp suffers severe performance degradation once the max context is hit.

Edit: The degradation is not generation speed, but prompt processing speed.

2 replies

Green-Sky Jul 30, 2023
Collaborator

llama2 70B and falcon 40B should not suffer as much from this.

netrunnereve Jul 31, 2023
Collaborator

@bucketcat What prompt processing numbers are you seeing with GPTQ vs GGML on a similar model?

Comparable performance of GGML with GPTQ -- Is this a fair comparison? Are there any benchmarks out there? #2424

Uh oh!

longregen Jul 27, 2023

Replies: 6 comments · 2 replies

Uh oh!

ikawrakow Jul 27, 2023

Uh oh!

Uh oh!

Green-Sky Jul 27, 2023 Collaborator

Uh oh!

Uh oh!

ianscrivener Jul 27, 2023

Uh oh!

longregen Jul 28, 2023 Author

Uh oh!

mirek190 Jul 28, 2023

Uh oh!

Uh oh!

bucketcat Jul 30, 2023

Uh oh!

Green-Sky Jul 30, 2023 Collaborator

Uh oh!

netrunnereve Jul 31, 2023 Collaborator

longregen
Jul 27, 2023

Replies: 6 comments 2 replies

ikawrakow
Jul 27, 2023

Green-Sky
Jul 27, 2023
Collaborator

ianscrivener
Jul 27, 2023

longregen
Jul 28, 2023
Author

mirek190
Jul 28, 2023

bucketcat
Jul 30, 2023

Green-Sky Jul 30, 2023
Collaborator

netrunnereve Jul 31, 2023
Collaborator