ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Hongbo Liu^{1, 3*}, Jingwen He^{2, 3*}, Yi Jin¹, Dian Zheng³, Yuhao Dong⁴, Fan Zhang³, Ziqi Huang⁴, Yinan He³, Yangguang Li³, Weichao Chen¹, Yu Qiao³, Wanli Ouyang², Shengjie Zhao^1†, Ziwei Liu^4†

(* equal contributions) († corresponding authors)

¹ Tongji University ² The Chinese University of Hong Kong
³ Shanghai Artificial Intelligence Laboratory ⁴ S-Lab, Nanyang Technological University

🎬 Overview

We introduce ShotBench, a comprehensive benchmark for evaluating VLMs’ understanding of cinematic language. It comprises over 3.5 k expert-annotated QA pairs derived from images and video clips of over 200 critically acclaimed films (predominantly Oscar-nominated), covering eight distinct cinematography dimensions. This provides a rigorous new standard for assessing fine-grained visual comprehension in film.
We conducted an extensive evaluation of 24 leading VLMs, including prominent open-source and proprietary models, on ShotBench. Our results reveal a critical performance gap: even the most capable model, GPT-4o, achieves less than 60 % average accuracy. This systematically quantifies the current limitations of VLMs in genuine cinematographic comprehension.
To address the identified limitations and facilitate future research, we constructed ShotQA, the first large-scale multimodal dataset for cinematography understanding, containing approximately 70 k high-quality QA pairs. Leveraging ShotQA, we developed ShotVL, a novel VLM trained using Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). ShotVL significantly surpasses all tested open-source and proprietary models, establishing a new state-of-the-art on ShotBench.

🔥 News

2025-09-12 🎉🎉We’re excited to announce a major upgrade to ShotVL-7B-v1.1! Its open-ended question answering capabilities have been significantly enhanced through training on our expanded ShotQA-V1.1 dataset.
2025-07-7 Release Evaluation code.
2025-07-2 Release ShotQA-70k dataset.
2025-06-27 Release ShotBench test split.
2025-06-27 Release our paper: ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models.
2025-06-27 Release ShotVL-7B and ShotVL-3B, these models are currently SOTA VLMs on cinematography understanding.

Installation

conda create -n shotbench python=3.10
conda activate shotbench
pip install -r requirements.txt

Evaluation

1.Preparing ShotBench Test Data

mkdir -p evaluation/data && cd evaluation/data
huggingface-cli download --repo-type dataset Vchitect/ShotBench --local-dir ShotBench
cd ShotBench
tar -xvf images.tar
tar -xvf videos.tar
cd ../../../

2.Run Evaluation Code

Evaluate ShotVL-3B with 4 GPUs:

accelerate launch --num_processes 4 evaluation/shotvl/evaluate.py --model ShotVL-3B --reasoning --output-dir eval_results

Evaluate ShotVL-7B with 4 GPUs:

accelerate launch --num_processes 4 evaluation/shotvl/evaluate.py --model ShotVL-7B --reasoning --output-dir eval_results

3.Calculate Metrics

OPENAI_API_KEY=YOUR_OPENAI_APIKEY python evaluation/calculate_scores.py --prediction_path OUTPUT_FILE_PATH

Evaluation Results

Abbreviations: SS = Shot Size, SF = Shot Framing, CA = Camera Angle, LS = Lens Size, LT = Lighting Type, LC = Lighting Conditions, SC = Shot Composition, CM = Camera Movement. Underline marks previous best in each group.
Our ShotVL models establish new SOTA.

Models	SS	SF	CA	LS	LT	LC	SC	CM	Avg
Open-Sourced VLMs
Qwen2.5-VL-3B-Instruct	54.6	56.6	43.1	36.6	59.3	45.1	41.5	31.9	46.1
Qwen2.5-VL-7B-Instruct	69.1	73.5	53.2	47.0	60.5	47.4	49.9	30.2	53.8
LLaVA-NeXT-Video-7B	35.9	37.1	32.5	27.8	50.9	31.7	28.0	31.3	34.4
LLaVA-Video-7B-Qwen2	56.9	65.4	45.1	36.0	63.5	45.4	37.4	35.3	48.1
LLaVA-Onevision-Qwen2-7B-Ov-Chat	58.4	71.0	52.3	38.7	59.5	44.9	50.9	39.7	51.9
InternVL2.5-8B	56.3	70.3	50.8	41.1	60.2	45.1	50.1	33.6	50.9
InternVL3-2B	56.3	56.0	44.4	34.6	56.8	44.6	43.0	38.1	46.7
InternVL3-8B	62.1	65.8	46.8	42.9	58.0	44.3	46.8	44.2	51.4
InternVL3-14B	59.6	82.2	55.4	40.7	61.7	44.6	51.1	38.2	54.2
Internlm-xcomposer2d5-7B	51.1	71.0	39.8	32.7	59.3	35.7	35.7	38.8	45.5
Ovis2-8B	35.9	37.1	32.5	27.8	50.9	31.7	28.0	35.3	34.9
VILA1.5-3B	33.4	44.9	32.1	28.6	50.6	35.7	28.4	21.5	34.4
VILA1.5-8B	40.6	44.5	39.1	29.7	48.9	32.9	34.4	36.9	38.4
VILA1.5-13B	36.7	54.6	40.7	34.8	52.8	35.4	34.2	31.3	40.1
Instructblip-vicuna-7B	27.0	27.9	34.5	29.4	44.4	29.7	27.1	25.0	30.6
Instructblip-vicuna-13B	26.8	29.2	27.9	28.0	39.0	24.0	27.1	22.0	28.0
InternVL2.5-38B	67.8	85.4	55.4	41.7	61.7	48.9	52.4	44.0	57.2
InternVL3-38B	68.0	84.0	51.9	43.6	64.4	46.9	54.7	44.6	57.3
Qwen2.5-VL-32B-Instruct	62.3	76.6	51.0	48.3	61.7	44.0	52.2	43.8	55.0
Qwen2.5-VL-72B-Instruct	75.1	82.9	56.7	46.8	59.0	49.4	54.1	48.9	59.1
InternVL3-78B	69.7	80.0	54.5	44.0	65.5	47.4	51.8	44.4	57.2
Proprietary VLMs
Gemini-2.0-flash	48.9	75.5	44.6	31.9	62.2	48.9	52.4	47.4	51.5
Gemini-2.5-flash-preview-04-17	57.7	82.9	51.4	43.8	65.2	45.7	45.9	43.5	54.5
GPT-4o	69.3	83.1	58.2	48.9	63.2	48.0	55.2	48.3	59.3
Ours
ShotVL-3B	77.9	85.6	68.8	59.3	65.7	53.1	57.4	51.7	65.1
ShotVL-7B	81.2	90.1	78.0	68.5	70.1	64.3	45.7	62.9	70.1

Open-Sourcing Plan

BibTeX

@misc{
      liu2025shotbench,
      title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models}, 
      author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu},
      year={2025},
      eprint={2506.21356},
      achivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.21356}, 
    }

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
evaluation		evaluation
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

🎬 Overview

🔥 News

Installation

Evaluation

1.Preparing ShotBench Test Data

2.Run Evaluation Code

3.Calculate Metrics

Evaluation Results

Open-Sourcing Plan

BibTeX

About

Uh oh!

Releases

Packages

Languages

Vchitect/ShotBench

Folders and files

Latest commit

History

Repository files navigation

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

🎬 Overview

🔥 News

Installation

Evaluation

1.Preparing ShotBench Test Data

2.Run Evaluation Code

3.Calculate Metrics

Evaluation Results

Open-Sourcing Plan

BibTeX

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages