Dataset from the paper: Understanding the World's Museums through Vision-Language Reasoning
MUSEUM-65 is a multi-modal dataset containing 65M images with 200M question-answer pairs in multiple languages, ensuring cultural diversity.
The dataset covers 50M objects with questions in English and 15M with questions in other languages (French, Spanish, German, etc).
The dataset is available on HuggingFace: Museum-65
The dataset contains:
- first 52 batches = 1MN dataset used in exeperiments
- first 473 batches = 10MN dataset used in experiments
- up to 1721 batches with all images with the english information
license: cc-by-nc-4.0
We introduce a comprehensive benchmark for MUSEUM-65, that evaluates general and specific tasks across different metrics. This benchmark provides a standardized frame
work, allowing for consistent comparison of various methods on this dataset, aiming to guide future research towards effective models and identifying areas for improvement:
- General VQA, Category-wise VQA, Multiple Angles, Visually Unanswerable Questions, Multiple Languages
In our experiments we use two models known for VQA tasks, LLaVA and BLIP, following their finetuning protocols when possible, using our dataset.
For a better understanding for how to finetune these models please access their GitHub pages:
Project realised with: INSAIT-Institute for Computer Science, Artificial Intelligence and Technology, University St. Kliment Ohridski Sofia, Bulgaria
Please cite accordingly:
@misc{balauca2024understandingworldsmuseumsvisionlanguage,
title={Understanding the World's Museums through Vision-Language Reasoning},
author={Ada-Astrid Balauca and Sanjana Garai and Stefan Balauca and Rasesh Udayakumar Shetty and Naitik Agrawal and Dhwanil Subhashbhai Shah and Yuqian Fu and Xi Wang and Kristina Toutanova and Danda Pani Paudel and Luc Van Gool},
year={2024},
eprint={2412.01370},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.01370},
}