gguf-eval

An evaluation framework for GGUF models using llama.cpp.

Getting Started

Install llama.cpp.
Clone this repository: git clone https://github.com/kallewoof/gguf-eval.git
Change dir: cd gguf-eval
Install requirements: pip install -r requirements.txt
Get some models.
Evaluate on all available tasks: python evaluate.py model1.gguf model2.gguf ...

Details

If you are e.g. on Windows and you're not running WSL, you may need to pass --disable_ansi.
Unless you installed llama.cpp so it is available from the shell, you need to pass a --llama_path to the llama.cpp directory: python evaluate.py --llama_path ../llama.cpp model1.gguf model2.gguf ...
If you need to pass arguments to llama.cpp for all models you can use --llama_args "\--arg1=x".
If you need to pass arguments to llama.cpp for a specific model in your list only, you can use --model_args: example: python evaluate.py llama-x.gguf nvidia-nemotron-49b.gguf GLM-4.5-Air.gguf --model_args emotron:"-ts 10/18" --model_args GLM-4.5-Air:"--n-cpu-moe 22 -ts 24/10"
You can select or exclude tasks using the --tasks argument: python evaluate.py ... --tasks exclude:mmlu,hellaswag

Render Plot

After running evaluate.py on some tasks, you can plot these. The results look something like this (for --overlay mode):

You need plotly: pip install plotly
Run python plot.py model1.gguf model2.gguf ....

Details

You can use --overlay to display all models in one graph, overlayed. The default is to show each one separately in a grid.
You can normalize the scores using --normalization. There are two modes, cap and range. cap means the models are all normalized so that the best performing model gets a 100% score, and the other models proportionately to that. E.g. if the model scores are 0.1, 0.2, and 0.3, this will be displayed as 33, 66, and 100% respectively. range means the models are normalized so that 0% is the worst performing model and 100% is the best performing model. The previous case would display as 0%, 50%, and 100%. The default is none.
The default behavior is to generate a html file and open in your browser. You can instead use e.g. --renderer=png to output to a png file, although the quality of this is not great at the moment.

Plot explanation

In the overlay mode, each model is prefixed with a number. This is the sum of the scores for that model, for all tasks. Models are also sorted by the scores in both grid and overlay mode.

Troubleshooting

error: invalid argument: -kvu: update your llama.cpp installation.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
plot.py		plot.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tasks.yml		tasks.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gguf-eval

Getting Started

Details

Render Plot

Details

Plot explanation

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

License

kallewoof/gguf-eval

Folders and files

Latest commit

History

Repository files navigation

gguf-eval

Getting Started

Details

Render Plot

Details

Plot explanation

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages