Skip to content

VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos

Notifications You must be signed in to change notification settings

mbzuai-oryx/VideoMathQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Oryx Video-ChatGPT

MBZUAI, University of California Merced, Google Research, Australian National University, Linköping University

Paper Website Dataset 🏅 Leaderboard (Reasoning) 🏅 Leaderboard (Direct) 📊 Eval (LMMs-Eval)


📣 Announcement

Note that the Official evaluation for VideoMathQA is supported in the lmms-eval framework. Please use this GitHub repository to create or track any issues related to VideoMathQA that you may encounter, so we can assist you.


💡 VideoMathQA

VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos. It requires models to interpret and integrate information from three modalities, visuals, audio, and text, across time. The benchmark tackles the needle-in-a-multimodal-haystack problem, where key information is sparse and spread across different modalities and moments in the video.

Highlight Figure
The foundation of our benchmark is the needle-in-a-multimodal-haystack challenge, capturing the core difficulty of cross-modal reasoning across time from visual, textual, and audio streams. Built on this, VideoMathQA categorizes each question along four key dimensions: reasoning type, mathematical concept, video duration, and difficulty.


🔥 Highlights

  • Multimodal Reasoning Benchmark: VideoMathQA introduces a challenging needle-in-a-multimodal-haystack setup where models must reason across visuals, text and audio. Key information is sparsely distributed across modalities and time, requiring strong performance in fine-grained visual understanding, multimodal integration, and reasoning.

  • Three Types of Reasoning: Questions are categorized into: Problem Focused, where the question is explicitly stated and solvable via direct observation and reasoning from the video; Concept Transfer, where a demonstrated method or principle is adapted to a newly posed problem; Deep Instructional Comprehension, which requires understanding long-form instructional content, interpreting partially worked-out steps, and completing the solution.

  • Diverse Evaluation Dimensions: Each question is evaluated across four axes, which captures diversity in content, length, complexity, and reasoning depth. mathematic concepts, 10 domains such as geometry, statistics, arithmetics and charts; video duration ranging from 10s to 1 hour long categorized as short, medium, long; difficulty level; and reasoning type.

  • High-Quality Human Annotations: The benchmark includes 420 expert-curated questions, each with five answer choices, a correct answer, and detailed chain-of-thought (CoT) steps. Over 2,945 reasoning steps have been manually written, reflecting 920+ hours of expert annotation effort with rigorous quality control.


📊 Overview and Analysis of VideoMathQA

🔍 Examples from the Benchmark

We present example questions from VideoMathQA illustrating the three reasoning types: Problem Focused, Concept Transfer, and Deep Comprehension. The benchmark includes evolving dynamics in a video, complex text prompts, five multiple-choice options, the expert-annotated step-by-step reasoning to solve the given problem, and the final correct answer as shown above.

Figure 1


📈 Overview of VideoMathQA

We illustrate an overview of the VideoMathQA benchmark through: a) The distribution of questions and model performance across ten mathematical concepts, which highlights a significant gap in the current multimodal models and their ability to perform mathematical reasoning over videos. b) The distribution of video durations, spanning from short clips of 10s to long videos up to 1hr. c) Our three-stage annotation pipeline performed by expert science graduates, who annotate detailed step-by-step reasoning trails, with strict quality assessment at each stage.

Figure 2


🎞️ Effect of Video Length, Subtitles, and Frame Count on Multimodal Reasoning

We analyze the performance of models on VideoMathQA under different settings: a) Across video duration categories, b) With and without subtitles, and c) Varying the number of input frames. We observe that models perform best on medium-length videos, and overall accuracy improves with the inclusion of subtitles and more frames during evaluation.

Figure 3


⚠️ Understanding Model Limitations in VideoMathQA Reasoning

We conduct an in-depth analysis of model limitations in VideoMathQA. a) We compare vision-blind, image-only, and video models, showing the necessity of video-level understanding for success. b) We show the distribution of questions across three difficulty levels and varying reasoning depths, highlighting the correlation between difficulty and model performance. c) We analyze CoT-based error types, revealing that most errors stem from misinterpreting the question or missing crucial multimodal cues.

Figure 4


📜 Citation

  @article{rasheed2025videomathqa,
          title={VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos},
          author={Rasheed, Hanoona and Shaker, Abdelrahman and Tang, Anqi and Maaz, Muhammad and Yang, Ming-Hsuan and Khan, Salman and Khan, Fahad S.},
          journal={arXiv preprint arXiv:2506.05349},
          year={2025}
  }

🙏 Acknowledgement

We thank LMMs-Lab for their open-source contributions, particularly LMMs-Eval, which we used to evaluate models and which serves as the official toolkit for evaluating on our benchmark.


About

VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published