Hongbo Liu1, 3*, Jingwen He2, 3*, Yi Jin1, Dian Zheng3, Yuhao Dong4, Fan Zhang3, Ziqi Huang4, Yinan He3, Yangguang Li3, Weichao Chen1, Yu Qiao3, Wanli Ouyang2, Shengjie Zhao1†, Ziwei Liu4†
(* equal contributions) († corresponding authors)
1 Tongji University
2 The Chinese University of Hong Kong
3 Shanghai Artificial Intelligence Laboratory
4 S-Lab, Nanyang Technological University
- We introduce ShotBench, a comprehensive benchmark for evaluating VLMs’ understanding of cinematic language. It comprises over 3.5 k expert-annotated QA pairs derived from images and video clips of over 200 critically acclaimed films (predominantly Oscar-nominated), covering eight distinct cinematography dimensions. This provides a rigorous new standard for assessing fine-grained visual comprehension in film.
- We conducted an extensive evaluation of 24 leading VLMs, including prominent open-source and proprietary models, on ShotBench. Our results reveal a critical performance gap: even the most capable model, GPT-4o, achieves less than 60 % average accuracy. This systematically quantifies the current limitations of VLMs in genuine cinematographic comprehension.
- To address the identified limitations and facilitate future research, we constructed ShotQA, the first large-scale multimodal dataset for cinematography understanding, containing approximately 70 k high-quality QA pairs. Leveraging ShotQA, we developed ShotVL, a novel VLM trained using Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). ShotVL significantly surpasses all tested open-source and proprietary models, establishing a new state-of-the-art on ShotBench.
- 2025-09-12 🎉🎉We’re excited to announce a major upgrade to ShotVL-7B-v1.1! Its open-ended question answering capabilities have been significantly enhanced through training on our expanded ShotQA-V1.1 dataset.
- 2025-07-7 Release Evaluation code.
- 2025-07-2 Release ShotQA-70k dataset.
- 2025-06-27 Release ShotBench test split.
- 2025-06-27 Release our paper: ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models.
- 2025-06-27 Release ShotVL-7B and ShotVL-3B, these models are currently SOTA VLMs on cinematography understanding.
conda create -n shotbench python=3.10
conda activate shotbench
pip install -r requirements.txt
mkdir -p evaluation/data && cd evaluation/data
huggingface-cli download --repo-type dataset Vchitect/ShotBench --local-dir ShotBench
cd ShotBench
tar -xvf images.tar
tar -xvf videos.tar
cd ../../../
Evaluate ShotVL-3B with 4 GPUs:
accelerate launch --num_processes 4 evaluation/shotvl/evaluate.py --model ShotVL-3B --reasoning --output-dir eval_results
Evaluate ShotVL-7B with 4 GPUs:
accelerate launch --num_processes 4 evaluation/shotvl/evaluate.py --model ShotVL-7B --reasoning --output-dir eval_results
OPENAI_API_KEY=YOUR_OPENAI_APIKEY python evaluation/calculate_scores.py --prediction_path OUTPUT_FILE_PATH
Our ShotVL models establish new SOTA.
- Release Training code.
- Release Evaluation code.
- Release ShotQA-70k dataset.
- Release ShotBench test set.
- Release ShotVL models.
@misc{
liu2025shotbench,
title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models},
author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu},
year={2025},
eprint={2506.21356},
achivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.21356},
}