🧪 Modelus

Modelus is a modular testing framework built to evaluate and compare foundation language models across structured tasks — with a current focus on AI-powered presentation generation.

Forget vibes-based benchmarking. Modelus runs test cases across multiple providers (OpenAI, Anthropic, Mistral, Google, Meta, etc.), logs every metric, and tells you exactly what worked (and what didn’t).

🚀 Features

🔁 Multi-Model Support: Plug-and-play with OpenAI, Anthropic, Gemini, Mistral, Meta, Cohere & more
📊 Structured Evaluation: Tests content quality, consistency, formatting, creativity, personalization, efficiency, etc.
📈 Comprehensive Reports: Generate detailed reports with scores, breakdowns, and visual comparisons
💰 API Usage & Cost Tracking: Monitor token usage and estimate expenses per run
📦 Extensible Design: Easily add models, evaluation logic, test cases, or new scoring dimensions

🗂️ Project Structure

modelus/
├── clients/               # Provider-specific model wrappers
├── configs/
│   ├── models/            # YAML configs per provider (e.g. openai.yaml)
│   └── test_config.yaml   # Main test plan
├── evaluators/            # Modular metric evaluators
├── reporting/             # Report generation & visualization
├── test_cases/            # JSON-based test definitions
├── utils/                 # Logging, costs, file ops
├── client_factory.py      # Client abstraction layer
├── test_executor.py       # Main test orchestrator
├── test_integration.py    # CLI integration runner
└── README.md              # This file

⚙️ Installation

git clone https://github.com/yourusername/modelus
cd modelus
pip install -r requirements.txt

Then set your .env with the required API keys:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...
COHERE_API_KEY=...

🧪 Running Tests

▶️ Basic Run

python -m foundation_model_testing

🛠 Advanced Options

Run a subset of models and test categories:

python -m foundation_model_testing \
  --models gpt-4o claude-3-opus mistral-large \
  --categories content formatting

List supported models or tests:

python -m foundation_model_testing --list-models
python -m foundation_model_testing --list-tests

Use a custom config and export directory:

python -m foundation_model_testing \
  --config my_config.yaml \
  --output-dir results/my_run

📝 Configurable Test Plans

All model and test definitions are YAML-driven.

✅ Example `test_config.yaml`

selected_models:
  - gpt-4o
  - claude-3-opus
  - gemini-1.5-pro
  - mistral-large
  - llama-3-70b

categories:
  - content
  - formatting
  - consistency
  - creativity
  - efficiency
  - personalization
  - ppt

metrics:
  primary: "overall_score"
  weights:
    content: 0.25
    formatting: 0.15
    consistency: 0.15
    creativity: 0.15
    efficiency: 0.15
    personalization: 0.10
    ppt: 0.05

📊 Reporting & Output

After each run, you'll get a full report (Markdown, JSON, CSV) under:

results/reports/

This includes:

Overall and per-category scores
Cost summaries (tokens + estimated $)
Heatmaps, radar charts, and ranking tables

🧩 Extending Modelus

➕ Add a New Model Provider

Add a config in configs/models/yourprovider.yaml
Create YourProviderClient in clients/
Register it in client_factory.py

➕ Add a New Evaluation Metric

Create a new evaluator class in evaluators/
Inherit from BaseEvaluator
Register it in evaluator_factory.py

➕ Add New Test Cases

Drop a JSON file into any folder in test_cases/ with:

{
  "name": "Test Presentation Bullet Structuring",
  "input": "Create a 5-slide deck about quantum computing",
  "expected_structure": ["intro", "history", "applications", "challenges", "future"],
  "category": "content"
}

📦 Requirements

Python 3.10+
Dependencies listed in requirements.txt and environment.yaml

⚖️ License

MIT License — use it, fork it, break it, build it better.

👋 A Note

Modelus was designed to be pragmatic and modular — test ideas fast, compare cleanly, and build your own flavors of evaluation.

If you want help integrating new types of tasks (code gen, doc QA, summarization), ping me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧪 Modelus

🚀 Features

🗂️ Project Structure

⚙️ Installation

🧪 Running Tests

▶️ Basic Run

🛠 Advanced Options

📝 Configurable Test Plans

✅ Example `test_config.yaml`

📊 Reporting & Output

🧩 Extending Modelus

➕ Add a New Model Provider

➕ Add a New Evaluation Metric

➕ Add New Test Cases

📦 Requirements

⚖️ License

👋 A Note

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
clients		clients
configs		configs
costs		costs
evaluators		evaluators
reporting		reporting
test_cases		test_cases
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
client_factory.py		client_factory.py
environment.yaml		environment.yaml
requirements.txt		requirements.txt
test_executor.py		test_executor.py
test_imports.py		test_imports.py
test_integration.py		test_integration.py

CodexAbhi/Modelus

Folders and files

Latest commit

History

Repository files navigation

🧪 Modelus

🚀 Features

🗂️ Project Structure

⚙️ Installation

🧪 Running Tests

▶️ Basic Run

🛠 Advanced Options

📝 Configurable Test Plans

✅ Example test_config.yaml

📊 Reporting & Output

🧩 Extending Modelus

➕ Add a New Model Provider

➕ Add a New Evaluation Metric

➕ Add New Test Cases

📦 Requirements

⚖️ License

👋 A Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

✅ Example `test_config.yaml`

Packages