Test and compare different large language models on various tasks.
- Code Generation
- Mathematical Reasoning
- Creative Writing
- Data Analysis
- Logical Reasoning
- Summarisation
- Technical Explanation
- Problem Solving
Use the package manager pip to install following
## Prerequisites
python3 -m venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
python3 -m pip install --upgrade pip
## Set Environment Variables
OPENAI_API_KEY="your_api_key"
ANTHROPIC_API_KEY="your_api_key"
XAI_API_KEY="your_api_key"
DEEPSEEK_TOKEN="your_api_key"
GOOGLE_API_KEY="your_api_key"
MOONSHOT_API_KEY="your_api_key"
OPENROUTER_API_KEY="your_api_key"
MISTRAL_API_KEY="your_api_key"
## View Setup Guide
python3 -m scripts.benchmark setup
## Execute The Benchmark
python3 -m scripts.benchmark
## Deactivate Virtual Environment
deactivate
Generated results on the 15th July 2025.
================================================================================
LLM BENCHMARK SUMMARY
================================================================================
Avg Quality Quality Std Avg Latency Med Latency Avg Cost Total Cost Error Rate
model_name
Anthropic-claude-sonnet-4-20250514 89.376 12.846 15.611 11.989 0.018 0.294 0.00
DeepSeek-deepseek-chat 89.476 10.643 47.782 43.808 0.001 0.014 0.00
Google-gemini-2.5-pro 79.990 25.111 25.630 22.686 0.002 0.024 6.25
Mistral-mistral-medium-2505 87.411 13.699 15.303 12.847 0.005 0.076 0.00
Moonshot-moonshot-v1-8k 89.498 13.530 7.811 7.455 0.001 0.014 0.00
OpenAI-gpt-4.1 89.388 12.633 9.624 7.680 0.009 0.149 0.00
Qwen-qwen3-32b 43.802 46.133 50.752 48.419 0.000 0.000 0.00
xAI-grok-4-0709 91.358 9.587 27.256 27.061 0.025 0.395 0.00
================================================================================
Best Overall Quality: xAI-grok-4-0709
Fastest Response: Moonshot-moonshot-v1-8k
Most Cost-Effective: Qwen-qwen3-32b
📁 All results saved to results directory
🏆 Best Overall Model: xAI-grok-4-0709
📈 Overall Average Quality: 82.5%
💰 Total Cost: $0.9652
⚡ Average Latency: 24.97s
This project is licensed under the Modified MIT License.
@misc{llmbenchmark,
author = {Oketunji, A.F.},
title = {LLM Benchmark},
year = 2025,
version = {0.0.6},
publisher = {Zenodo},
doi = {10.5281/zenodo.15937532},
url = {https://doi.org/10.5281/zenodo.15937532}
}
(c) 2025 Finbarrs Oketunji. All Rights Reserved.