The Local LLM Comparator is an interactive tool designed to help you systematically evaluate and compare multiple local language models (LLMs) side by side.
- Objective Comparison: Manually compare responses from different models without bias
- Flexible Evaluation: Easily test models on any prompt or task
- Comprehensive Ranking: Generate advanced scoring based on pairwise comparisons
- Logging and Analysis: Automatically log comparisons and generate ranking reports
Enter your prompt in the main function.
- Presents responses from two models simultaneously
- You choose which response is better (or skip)
- Tracks direct wins between models, calculates scores considering win rate and comparison frequency, and generates a comprehensive ranking.
- Ollama (for local model inference)
- LLMs from ollama
- Required Python packages:
ollama
requests
# Clone the repository
git clone https://github.com/yourusername/local-llm-comparator.git
cd local-llm-comparator
# Install dependencies
pip install -r requirements.txt
# Ensure Ollama is installed and models are pulled
ollama pull llama3.2
ollama pull gemma2:2b
# Pull any other models you want to compare
Modify the _configure_model_clients()
method in the ModelComparisonEvaluator
class to customize your model list:
def _configure_model_clients(self) -> Dict[str, ModelClient]:
return {
**{model: OllamaClient() for model in [
"llama3.2:latest",
"gemma2:2b",
# Add or remove models as needed
]}
}
In the main function update the prompt and run the code.
def main():
# Example: Extracting job details from a URL
url = "your_target_url"
response = requests.get(url)
page_content = response.text
prompt = f"""Your specific task description.
Content to analyze:
{page_content}"""
evaluator = ModelComparisonEvaluator()
evaluator.run_comprehensive_comparisons(prompt)
evaluator.print_rankings()
The tool generates two key outputs:
-
Comparison Log:
comparison_log.txt
- Tracks which model won each comparison
- Includes timestamp of comparisons
-
Rankings CSV:
model_rankings_[timestamp].csv
- Contains final model rankings
- Includes advanced scoring metrics