Compare and benchmark image-to-text models from OpenAI and AWS Bedrock on the XTD10 dataset—measure accuracy, latency, and cost in one place.
- Automatic dataset setup
Downloads and extracts the XTD10 multilingual image corpus. - Multi-model captioning
Generates captions using OpenAI GPT-4o variants and AWS Bedrock Nova Lite/Pro. - LLM-based evaluation
Scores generated captions against ground truth via a judge LLM. - Comprehensive metrics
Aggregates accuracy, latency, and cost; exports results as CSV.
- Python 3.8+
- OpenAI API Key — set
OPENAI_API_KEY
- AWS Credentials with Bedrock access — set
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
(andAWS_SESSION_TOKEN
if required)
git clone https://github.com/tavily-ai/image-caption-evaluator.git
cd image-caption-evaluator
pip install -r requirements.txt
python run_evaluation.py
The script will:
- Download & extract images (if needed)
- Fetch captions for the chosen language
- Generate and evaluate captions across all models
- Save
results.csv
with per-image metrics
A CSV with columns:
image_filename | model | similarity_score | latency | cost_usd | … |
---|
Use your favorite plotting library to visualize trade-offs.
- Fork the repo
- Create a feature branch
- Submit a PR
Ideas welcome:
- Add new LLM providers
- Support batching or async evaluation
- Extend to other vision-language tasks
Questions or custom integrations? Reach out to Tomer Weiss:
- Email:
- Tomer Weiss - Data Scientist @ Tavily
- Eyal Ben Barouch - Head of Data @ Tavily

Powered by Tavily — The web API built for AI agents