Skip to content

🧠 A comprehensive toolkit for benchmarking, optimizing, and deploying local Large Language Models. Includes performance testing tools, optimized configurations for CPU/GPU/hybrid setups, and detailed guides to maximize LLM performance on your hardware.

Notifications You must be signed in to change notification settings

BjornMelin/local-llm-workbench

Repository files navigation

πŸš€ Ollama Benchmarks

A comprehensive benchmarking suite for evaluating Ollama models on various performance metrics.

πŸ“‹ Table of Contents

πŸ”­ Overview

Ollama Benchmarks is a toolset for rigorously testing and comparing the performance of different large language models running via Ollama. The suite measures critical metrics including inference speed, memory usage, and parameter efficiency across different prompts and configurations.

✨ Features

  • πŸ“Š Measure inference speed (tokens per second)
  • πŸ’Ύ Monitor memory consumption (RAM and VRAM)
  • πŸ“ Evaluate parameter efficiency
  • πŸ“š Test performance with varying context lengths
  • πŸ“ˆ Analyze and compare results across models

πŸ› οΈ Prerequisites

  • Ollama installed and configured
  • Bash shell environment
  • Basic command line utilities (bc, nvidia-smi for GPU metrics)
  • Python 3.x for results analysis

πŸ“ Project Structure

ollama-benchmarks/
β”œβ”€β”€ benchmark_speed.sh      # Speed benchmarking script
β”œβ”€β”€ benchmark_memory.sh     # Memory usage benchmarking script
β”œβ”€β”€ benchmark_params.sh     # Parameter efficiency benchmarking
β”œβ”€β”€ benchmark_context.sh    # Context length benchmarking
β”œβ”€β”€ run_all_benchmarks.sh   # Script to run all benchmarks sequentially
β”œβ”€β”€ analyze_results.py      # Python script to analyze and visualize results
β”œβ”€β”€ prompts/                # Directory containing test prompts
β”‚   β”œβ”€β”€ creative.txt        # Creative writing prompts
β”‚   β”œβ”€β”€ short_qa.txt        # Question-answering prompts
β”‚   └── long_context.txt    # Long context evaluation prompts
β”œβ”€β”€ results/                # Directory where benchmark results are stored
└── logs/                   # Log files directory

πŸš€ Usage

Running Individual Benchmarks

Each benchmark script follows a similar pattern:

./benchmark_[type].sh [MODEL_NAME] [CONFIG_NAME] [PROMPT_FILE]

For example:

./benchmark_speed.sh llama2 default prompts/short_qa.txt

Running All Benchmarks

To run all benchmark types for a specific model:

./run_all_benchmarks.sh [MODEL_NAME] [CONFIG_NAME]

πŸ“Š Benchmark Types

Speed Benchmark

Measures inference speed in tokens per second for each prompt.

./benchmark_speed.sh [MODEL_NAME] [CONFIG_NAME] [PROMPT_FILE]

The script:

  • Processes each prompt in the specified file
  • Measures generation time and token count
  • Calculates tokens per second
  • Outputs results to results/[CONFIG_NAME]_speed_results.csv

Memory Benchmark

Measures CPU utilization, RAM, and VRAM usage during inference.

./benchmark_memory.sh [MODEL_NAME] [CONFIG_NAME] [PROMPT_FILE]

The script:

  • Runs the model in the background
  • Samples CPU usage and memory consumption
  • Detects GPU memory usage if applicable
  • Outputs results to results/[CONFIG_NAME]_memory_results.csv

Parameter Benchmark

Evaluates how efficiently the model uses its parameters across different prompt types.

./benchmark_params.sh [MODEL_NAME] [CONFIG_NAME] [PROMPT_FILE]

Context Length Benchmark

Tests model performance with varying context window sizes.

./benchmark_context.sh [MODEL_NAME] [CONFIG_NAME]

πŸ“ˆ Analyzing Results

After running benchmarks, analyze the results using the provided Python script:

python analyze_results.py [CONFIG_NAME]

This will generate visualizations and summary statistics for all benchmarks with the specified configuration name.

πŸ“Š Workflow Visualization

graph TD
    A[Select Model] --> B[Choose Benchmark Type]
    B --> C1[Speed Benchmark]
    B --> C2[Memory Benchmark]
    B --> C3[Parameter Benchmark]
    B --> C4[Context Length Benchmark]
    
    C1 --> D1[Generate CSV Results]
    C2 --> D2[Generate CSV Results]
    C3 --> D3[Generate CSV Results]
    C4 --> D4[Generate CSV Results]
    
    D1 --> E[Analyze Results]
    D2 --> E
    D3 --> E
    D4 --> E
    
    E --> F[Generate Visualizations]
    F --> G[Compare Models]
Loading

Sample Benchmark Process

sequenceDiagram
    participant User
    participant Benchmark Script
    participant Ollama
    participant Results File
    
    User->>Benchmark Script: Run with model & prompts
    Benchmark Script->>Ollama: Execute prompt
    Note over Benchmark Script: Start timer
    Ollama->>Benchmark Script: Return generated text
    Note over Benchmark Script: Stop timer
    Benchmark Script->>Benchmark Script: Calculate metrics
    Benchmark Script->>Results File: Write results
    Benchmark Script->>User: Display summary
Loading

πŸ‘₯ Contributing

Contributions are welcome! To contribute:

  1. Fork the repository
  2. Create a new branch for your feature
  3. Add your changes
  4. Submit a pull request

Please ensure your code follows the project's style guidelines and includes appropriate tests.

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


Created and maintained by Bjorn Melin, 2025

About

🧠 A comprehensive toolkit for benchmarking, optimizing, and deploying local Large Language Models. Includes performance testing tools, optimized configurations for CPU/GPU/hybrid setups, and detailed guides to maximize LLM performance on your hardware.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published