T2IScoreScore is a framework for evaluating text-to-image model evaluation metrics. The framework provides tools to:
-
Reference implementations of various classes of text-to-image metrics with a consistent API:
- Correlation-based metrics (CLIPScore)
- Likelihood-based metrics
- Visual Question-Answering metrics
-
Run these metrics on the T2IScoreScore dataset of semantic error graphs:
ts2 evaluate CLIPScore --device cuda
-
Compute metametrics to characterize how well the T2I metric ordered different nodes of images along walks of increasing error count:
ts2 compute CLIPScore spearman kstest
-
Generate visualizations and reports to analyze metric performance across different error types and image sources.
git clone https://github.com/michaelsaxon/T2IScoreScore.git
cd T2IScoreScore
pip install -e .
src/
├── T2IMetrics/ # Collection of text-to-image evaluation metrics
└── T2IScoreScore/ # Framework for evaluating metrics
├── evaluators/ # Metametric implementations (Spearman, KS test, etc.)
├── figures/ # Visualization utilities
└── run/ # CLI entry points
The package provides two main commands through the ts2
CLI:
Run a metric on the T2IScoreScore dataset:
# Basic usage
ts2 evaluate CLIPScore
# With options
ts2 evaluate TIFAScore --device cuda --output results/
Analyze metric performance using metametrics:
# Run all default evaluators (spearman, kstest, delta)
ts2 compute CLIPScore
# Run specific evaluators
ts2 compute TIFAScore spearman kendall
Results are saved in the following structure:
output/
├── scores/ # Raw metric scores
│ └── metric_scores.csv
└── metametrics/ # Metametric results
├── metric.csv # Per-example results
└── metric_averages.csv # Partition averages
To add a new metric:
- Create a new class in T2IMetrics inheriting from BaseMetric
- Implement the calculate_score() method
- Register in T2IMetrics/init.py
To add a new metametric:
- Create a new class in T2IScoreScore/evaluators inheriting from MetricEvaluator
- Implement required evaluation methods
- Register in evaluators/init.py