An evaluation framework for GGUF models using llama.cpp.
- Install llama.cpp.
- Clone this repository:
git clone https://github.com/kallewoof/gguf-eval.git
- Change dir:
cd gguf-eval
- Install requirements:
pip install -r requirements.txt
- Get some models.
- Evaluate on all available tasks:
python evaluate.py model1.gguf model2.gguf ...
- If you are e.g. on Windows and you're not running WSL, you may need to pass
--disable_ansi
. - Unless you installed llama.cpp so it is available from the shell, you need to pass a --llama_path to the llama.cpp directory:
python evaluate.py --llama_path ../llama.cpp model1.gguf model2.gguf ...
- If you need to pass arguments to llama.cpp for all models you can use
--llama_args "\--arg1=x"
. - If you need to pass arguments to llama.cpp for a specific model in your list only, you can use
--model_args
: example:python evaluate.py llama-x.gguf nvidia-nemotron-49b.gguf GLM-4.5-Air.gguf --model_args emotron:"-ts 10/18" --model_args GLM-4.5-Air:"--n-cpu-moe 22 -ts 24/10"
- You can select or exclude tasks using the
--tasks
argument:python evaluate.py ... --tasks exclude:mmlu,hellaswag
After running evaluate.py on some tasks, you can plot these. The results look something like this (for --overlay
mode):
- You need plotly:
pip install plotly
- Run
python plot.py model1.gguf model2.gguf ...
.
- You can use
--overlay
to display all models in one graph, overlayed. The default is to show each one separately in a grid. - You can normalize the scores using
--normalization
. There are two modes,cap
andrange
.cap
means the models are all normalized so that the best performing model gets a 100% score, and the other models proportionately to that. E.g. if the model scores are 0.1, 0.2, and 0.3, this will be displayed as 33, 66, and 100% respectively.range
means the models are normalized so that 0% is the worst performing model and 100% is the best performing model. The previous case would display as 0%, 50%, and 100%. The default isnone
. - The default behavior is to generate a html file and open in your browser. You can instead use e.g.
--renderer=png
to output to a png file, although the quality of this is not great at the moment.
In the overlay mode, each model is prefixed with a number. This is the sum of the scores for that model, for all tasks. Models are also sorted by the scores in both grid and overlay mode.
error: invalid argument: -kvu
: update your llama.cpp installation.