A comprehensive benchmarking suite for evaluating Ollama models on various performance metrics.
- π Overview
- β¨ Features
- π οΈ Prerequisites
- π Project Structure
- π Usage
- π Benchmark Types
- π Analyzing Results
- π Workflow Visualization
- π₯ Contributing
- π License
Ollama Benchmarks is a toolset for rigorously testing and comparing the performance of different large language models running via Ollama. The suite measures critical metrics including inference speed, memory usage, and parameter efficiency across different prompts and configurations.
- π Measure inference speed (tokens per second)
- πΎ Monitor memory consumption (RAM and VRAM)
- π Evaluate parameter efficiency
- π Test performance with varying context lengths
- π Analyze and compare results across models
- Ollama installed and configured
- Bash shell environment
- Basic command line utilities (
bc
,nvidia-smi
for GPU metrics) - Python 3.x for results analysis
ollama-benchmarks/
βββ benchmark_speed.sh # Speed benchmarking script
βββ benchmark_memory.sh # Memory usage benchmarking script
βββ benchmark_params.sh # Parameter efficiency benchmarking
βββ benchmark_context.sh # Context length benchmarking
βββ run_all_benchmarks.sh # Script to run all benchmarks sequentially
βββ analyze_results.py # Python script to analyze and visualize results
βββ prompts/ # Directory containing test prompts
β βββ creative.txt # Creative writing prompts
β βββ short_qa.txt # Question-answering prompts
β βββ long_context.txt # Long context evaluation prompts
βββ results/ # Directory where benchmark results are stored
βββ logs/ # Log files directory
Each benchmark script follows a similar pattern:
./benchmark_[type].sh [MODEL_NAME] [CONFIG_NAME] [PROMPT_FILE]
For example:
./benchmark_speed.sh llama2 default prompts/short_qa.txt
To run all benchmark types for a specific model:
./run_all_benchmarks.sh [MODEL_NAME] [CONFIG_NAME]
Measures inference speed in tokens per second for each prompt.
./benchmark_speed.sh [MODEL_NAME] [CONFIG_NAME] [PROMPT_FILE]
The script:
- Processes each prompt in the specified file
- Measures generation time and token count
- Calculates tokens per second
- Outputs results to
results/[CONFIG_NAME]_speed_results.csv
Measures CPU utilization, RAM, and VRAM usage during inference.
./benchmark_memory.sh [MODEL_NAME] [CONFIG_NAME] [PROMPT_FILE]
The script:
- Runs the model in the background
- Samples CPU usage and memory consumption
- Detects GPU memory usage if applicable
- Outputs results to
results/[CONFIG_NAME]_memory_results.csv
Evaluates how efficiently the model uses its parameters across different prompt types.
./benchmark_params.sh [MODEL_NAME] [CONFIG_NAME] [PROMPT_FILE]
Tests model performance with varying context window sizes.
./benchmark_context.sh [MODEL_NAME] [CONFIG_NAME]
After running benchmarks, analyze the results using the provided Python script:
python analyze_results.py [CONFIG_NAME]
This will generate visualizations and summary statistics for all benchmarks with the specified configuration name.
graph TD
A[Select Model] --> B[Choose Benchmark Type]
B --> C1[Speed Benchmark]
B --> C2[Memory Benchmark]
B --> C3[Parameter Benchmark]
B --> C4[Context Length Benchmark]
C1 --> D1[Generate CSV Results]
C2 --> D2[Generate CSV Results]
C3 --> D3[Generate CSV Results]
C4 --> D4[Generate CSV Results]
D1 --> E[Analyze Results]
D2 --> E
D3 --> E
D4 --> E
E --> F[Generate Visualizations]
F --> G[Compare Models]
sequenceDiagram
participant User
participant Benchmark Script
participant Ollama
participant Results File
User->>Benchmark Script: Run with model & prompts
Benchmark Script->>Ollama: Execute prompt
Note over Benchmark Script: Start timer
Ollama->>Benchmark Script: Return generated text
Note over Benchmark Script: Stop timer
Benchmark Script->>Benchmark Script: Calculate metrics
Benchmark Script->>Results File: Write results
Benchmark Script->>User: Display summary
Contributions are welcome! To contribute:
- Fork the repository
- Create a new branch for your feature
- Add your changes
- Submit a pull request
Please ensure your code follows the project's style guidelines and includes appropriate tests.
This project is licensed under the MIT License - see the LICENSE file for details.
Created and maintained by Bjorn Melin, 2025