Skip to content

Foundation model testing, but with receipts — structured test suites, metric breakdowns, and API-aware reporting.

Notifications You must be signed in to change notification settings

CodexAbhi/Modelus

Repository files navigation

🧪 Modelus

Modelus is a modular testing framework built to evaluate and compare foundation language models across structured tasks — with a current focus on AI-powered presentation generation.

Forget vibes-based benchmarking. Modelus runs test cases across multiple providers (OpenAI, Anthropic, Mistral, Google, Meta, etc.), logs every metric, and tells you exactly what worked (and what didn’t).


🚀 Features

  • 🔁 Multi-Model Support: Plug-and-play with OpenAI, Anthropic, Gemini, Mistral, Meta, Cohere & more
  • 📊 Structured Evaluation: Tests content quality, consistency, formatting, creativity, personalization, efficiency, etc.
  • 📈 Comprehensive Reports: Generate detailed reports with scores, breakdowns, and visual comparisons
  • 💰 API Usage & Cost Tracking: Monitor token usage and estimate expenses per run
  • 📦 Extensible Design: Easily add models, evaluation logic, test cases, or new scoring dimensions

🗂️ Project Structure

modelus/
├── clients/               # Provider-specific model wrappers
├── configs/
│   ├── models/            # YAML configs per provider (e.g. openai.yaml)
│   └── test_config.yaml   # Main test plan
├── evaluators/            # Modular metric evaluators
├── reporting/             # Report generation & visualization
├── test_cases/            # JSON-based test definitions
├── utils/                 # Logging, costs, file ops
├── client_factory.py      # Client abstraction layer
├── test_executor.py       # Main test orchestrator
├── test_integration.py    # CLI integration runner
└── README.md              # This file

⚙️ Installation

git clone https://github.com/yourusername/modelus
cd modelus
pip install -r requirements.txt

Then set your .env with the required API keys:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...
COHERE_API_KEY=...

🧪 Running Tests

▶️ Basic Run

python -m foundation_model_testing

🛠 Advanced Options

Run a subset of models and test categories:

python -m foundation_model_testing \
  --models gpt-4o claude-3-opus mistral-large \
  --categories content formatting

List supported models or tests:

python -m foundation_model_testing --list-models
python -m foundation_model_testing --list-tests

Use a custom config and export directory:

python -m foundation_model_testing \
  --config my_config.yaml \
  --output-dir results/my_run

📝 Configurable Test Plans

All model and test definitions are YAML-driven.

✅ Example test_config.yaml

selected_models:
  - gpt-4o
  - claude-3-opus
  - gemini-1.5-pro
  - mistral-large
  - llama-3-70b

categories:
  - content
  - formatting
  - consistency
  - creativity
  - efficiency
  - personalization
  - ppt

metrics:
  primary: "overall_score"
  weights:
    content: 0.25
    formatting: 0.15
    consistency: 0.15
    creativity: 0.15
    efficiency: 0.15
    personalization: 0.10
    ppt: 0.05

📊 Reporting & Output

After each run, you'll get a full report (Markdown, JSON, CSV) under:

results/reports/

This includes:

  • Overall and per-category scores
  • Cost summaries (tokens + estimated $)
  • Heatmaps, radar charts, and ranking tables

🧩 Extending Modelus

➕ Add a New Model Provider

  1. Add a config in configs/models/yourprovider.yaml
  2. Create YourProviderClient in clients/
  3. Register it in client_factory.py

➕ Add a New Evaluation Metric

  1. Create a new evaluator class in evaluators/
  2. Inherit from BaseEvaluator
  3. Register it in evaluator_factory.py

➕ Add New Test Cases

Drop a JSON file into any folder in test_cases/ with:

{
  "name": "Test Presentation Bullet Structuring",
  "input": "Create a 5-slide deck about quantum computing",
  "expected_structure": ["intro", "history", "applications", "challenges", "future"],
  "category": "content"
}

📦 Requirements

  • Python 3.10+
  • Dependencies listed in requirements.txt and environment.yaml

⚖️ License

MIT License — use it, fork it, break it, build it better.


👋 A Note

Modelus was designed to be pragmatic and modular — test ideas fast, compare cleanly, and build your own flavors of evaluation.

If you want help integrating new types of tasks (code gen, doc QA, summarization), ping me.

About

Foundation model testing, but with receipts — structured test suites, metric breakdowns, and API-aware reporting.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages