Modelus is a modular testing framework built to evaluate and compare foundation language models across structured tasks — with a current focus on AI-powered presentation generation.
Forget vibes-based benchmarking. Modelus runs test cases across multiple providers (OpenAI, Anthropic, Mistral, Google, Meta, etc.), logs every metric, and tells you exactly what worked (and what didn’t).
- 🔁 Multi-Model Support: Plug-and-play with OpenAI, Anthropic, Gemini, Mistral, Meta, Cohere & more
- 📊 Structured Evaluation: Tests content quality, consistency, formatting, creativity, personalization, efficiency, etc.
- 📈 Comprehensive Reports: Generate detailed reports with scores, breakdowns, and visual comparisons
- 💰 API Usage & Cost Tracking: Monitor token usage and estimate expenses per run
- 📦 Extensible Design: Easily add models, evaluation logic, test cases, or new scoring dimensions
modelus/
├── clients/ # Provider-specific model wrappers
├── configs/
│ ├── models/ # YAML configs per provider (e.g. openai.yaml)
│ └── test_config.yaml # Main test plan
├── evaluators/ # Modular metric evaluators
├── reporting/ # Report generation & visualization
├── test_cases/ # JSON-based test definitions
├── utils/ # Logging, costs, file ops
├── client_factory.py # Client abstraction layer
├── test_executor.py # Main test orchestrator
├── test_integration.py # CLI integration runner
└── README.md # This file
git clone https://github.com/yourusername/modelus
cd modelus
pip install -r requirements.txt
Then set your .env
with the required API keys:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...
COHERE_API_KEY=...
python -m foundation_model_testing
Run a subset of models and test categories:
python -m foundation_model_testing \
--models gpt-4o claude-3-opus mistral-large \
--categories content formatting
List supported models or tests:
python -m foundation_model_testing --list-models
python -m foundation_model_testing --list-tests
Use a custom config and export directory:
python -m foundation_model_testing \
--config my_config.yaml \
--output-dir results/my_run
All model and test definitions are YAML-driven.
selected_models:
- gpt-4o
- claude-3-opus
- gemini-1.5-pro
- mistral-large
- llama-3-70b
categories:
- content
- formatting
- consistency
- creativity
- efficiency
- personalization
- ppt
metrics:
primary: "overall_score"
weights:
content: 0.25
formatting: 0.15
consistency: 0.15
creativity: 0.15
efficiency: 0.15
personalization: 0.10
ppt: 0.05
After each run, you'll get a full report (Markdown, JSON, CSV) under:
results/reports/
This includes:
- Overall and per-category scores
- Cost summaries (tokens + estimated $)
- Heatmaps, radar charts, and ranking tables
- Add a config in
configs/models/yourprovider.yaml
- Create
YourProviderClient
inclients/
- Register it in
client_factory.py
- Create a new evaluator class in
evaluators/
- Inherit from
BaseEvaluator
- Register it in
evaluator_factory.py
Drop a JSON file into any folder in test_cases/
with:
{
"name": "Test Presentation Bullet Structuring",
"input": "Create a 5-slide deck about quantum computing",
"expected_structure": ["intro", "history", "applications", "challenges", "future"],
"category": "content"
}
- Python 3.10+
- Dependencies listed in
requirements.txt
andenvironment.yaml
MIT License — use it, fork it, break it, build it better.
Modelus was designed to be pragmatic and modular — test ideas fast, compare cleanly, and build your own flavors of evaluation.
If you want help integrating new types of tasks (code gen, doc QA, summarization), ping me.