Skip to content

The-Martyr/Awesome-Multimodal-Reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 

Repository files navigation

Awesome-Multimodal-Reasoning Awesome

This is a repository for organizing papres related to Multimodal Reasoning in Multimodal Large Language Models (Image, Video).

With the development of the visual (audio) capabilities and reasoning capabilities (RL powered) of multimodal large language models(MLLMs/LVLMs/LSLMs), researchers have high hopes for the multimodal reasoning capabilities of MLLM/LVLM/LSLM.

This repo also select paper about visual generation (image generation/video generation) with RL/CoT.

⭐ If you find this list useful, welcome to star it!

Paper List (Updating...)

Survey

(8 May 2025) Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models arXiv

(30 Apr 2025) Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models arXiv

(4 Apr 2025) Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning arXiv

(18 Mar 2025) Aligning Multimodal LLM with Human Preference: A Survey arXiv

(16 Mar 2025) Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey arXiv

Api

https://yuewen.cn/chats/new

Image Reasoning

(30 Jul 2025) MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention arXiv

(28 Jul 2025) Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback arXiv

(24 Jul 2025) MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning arXiv

(24 Jul 2025) SafeWork-R1: Coevolving Safety and Intelligence under the AI-45 Law arXiv

(22 Jul 2025) C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning arXiv

(22 Jul 2025) Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning arXiv

(11 Jul 2025) M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning arXiv

(3 Jul 2025) Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation arXiv

(1 Jul 2025) GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning arXiv

(20 Jun 2025) GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning arXiv

(16 Jun 2025) Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning arXiv

(11 Jun 2025) ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs arXiv

(5 Jun 2025) Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning arXiv

(5 Jun 2025) Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos arXiv

(5 Jun 2025) MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning arXiv

(16 May 2025) Visual Planning: Let's Think Only with Images arXiv

(15 May 2025) MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning arXiv

(13 May 2025) OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning arXiv

(12 May 2025) Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning arXiv

(8 May 2025) Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging arXiv

( 8 May 2025) SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models arXiv

(6 May 2025) X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains arXiv

(6 May 2025) Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning arXiv

(6 May 2025) ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant arXiv

(5 May 2025) R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning arXiv

(28 Apr 2025) SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning arXiv

(25 Apr 2025) Fast-Slow Thinking for Large Vision-Language Model Reasoning arXiv

(25 Apr 2025) Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization arXiv

(25 Apr 2025) Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning arXiv

(21 Apr 2025) A Call for New Recipes to Enhance Spatial Reasoning in MLLMs arXiv

(20 Apr 2025) Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension arXiv

(12 Apr 2025) VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search arXiv

(10 Apr 2025) VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model arXiv

(10 Apr 2025) SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement arXiv

(10 Apr 2025) Perception-R1: Pioneering Perception Policy with Reinforcement Learning arXiv

(10 Apr 2025) Kimi-VL Technical Report arXiv

(8 Apr 2025) On the Suitability of Reinforcement Fine-Tuning to Visual Tasks arXiv

(8 Apr 2025) Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought arXiv

(1 Apr 2025) Improved Visual-Spatial Reasoning via R1-Zero-Like Training arXiv

(17 Mar 2025) R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization arXiv

(13 Mar 2025) VisualPRM: An Effective Process Reward Model for Multimodal Reasoning arXiv

(9 Mar 2025) Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models arXiv

(7 Mar 2025) R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning arXiv

(7 Mar 2025) Unified Reward Model for Multimodal Understanding and Generation arXiv

(7 Mar 2025) R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model arXiv

(3 Mar 2025) Visual-RFT: Visual Reinforcement Fine-Tuning arXiv

(4 Feb 2025) Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking arXiv

(3 Jan 2025) Virgo: A Preliminary Exploration on Reproducing o1-like MLLM arXiv

(13 Jan 2025) Imagine while Reasoning in Space: Multimodal Visualization-of-Thought arXiv

(10 Jan 2025) LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs arXiv

(9 Jan 2025) Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark arXiv

(30 Dec 2024) Slow Perception: Let's Perceive Geometric Figures Step-by-step arXiv

(19 Dec 2024) Progressive Multimodal Reasoning via Active Retrieval arXiv

(29 Nov 2024) Interleaved-Modal Chain-of-Thought arXiv

(15 Nov 2024) Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination arXiv

(15 Nov 2024) LLaVA-CoT: Let Vision Language Models Reason Step-by-Step arXiv

(30 Oct 2024) Vision-Language Models Can Self-Improve Reasoning via Reflection arXiv

(23 Oct 2024) R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models arXiv

(21 Oct 2024) Improve Vision Language Model Chain-of-thought Reasoning arXiv

(11 Oct 2024) M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought arXiv

(6 Oct 2024) MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration arXiv

(4 Oct 2024) Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning arXiv

(29 Sep 2024) CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought arXiv

(13 Jun 2024) Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models arXiv

(28 Dec 2023) Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos arXiv

(14 Dec 2023) Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models arXiv

(27 Nov 2023) Compositional Chain-of-Thought Prompting for Large Multimodal Models arXiv

(15 Nov 2023) The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task arXiv

(3 May 2023) Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings arXiv

(16 Apr 2023) Chain of Thought Prompt Tuning in Vision Language Models arXiv

(2 Feb 2023) Multimodal Chain-of-Thought Reasoning in Language Models arXiv

Video

(12 Jun 2025) CogStream: Context-guided Streaming Video Question Answering arXiv

(6 Jun 2025) VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning arXiv

(27 Mar 2025) Video-R1: Reinforcing Video Reasoning in MLLMs arXiv

(17 Feb 2025) video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model arXiv

(10 Feb 2025) CoS: Chain-of-Shot Prompting for Long Video Understanding arXiv

(8 Jan 2025) Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs arXiv

(3 Dec 2024) VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation arXiv

(2 Dec 2024) Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation arXiv

(29 Nov 2024) STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training arXiv

(21 Oct 2024) Improve Vision Language Model Chain-of-thought Reasoning arXiv

(12 Oct 2024) Interpretable Video based Stress Detection with Self-Refine Chain-of-thought Reasoning arXiv

(27 Sep 2024) Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks arXiv

(28 Aug 2024) Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation arXiv

(24 May 2024) Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models arXiv

(7 May 2024) Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. arXiv code

(8 Oct 2024) Temporal Reasoning Transfer from Text to Video. arXiv

Audio

(22 Jul 2025) Step-Audio 2 Technical Report arXiv

(14 Mar 2025) Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering arXiv

Image/Video Generation

(28 Jul 2025) Multimodal LLMs as Customized Reward Models for Text-to-Image Generation arXiv

(20 Jun 2025) RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought arXiv

(17 Jun 2025) SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks arXiv

(16 May 2025) Towards Self-Improvement of Diffusion Models via Group Preference Optimization arXiv

(16 May 2025) Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models arXiv

(15 May 2025) Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models arXiv

(12 May 2025) DanceGRPO: Unleashing GRPO on Visual Generation arXiv

(8 May 2025) Flow-GRPO: Training Flow Matching Models via Online RL arXiv

(1 May 2025) T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT arXiv

(22 Apr 2025) From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning arXiv

(22 Apr 2025) Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning arXiv

(26 Mar 2025) MMGen: Unified Multi-modal Image Generation and Understanding in One Go arXiv

(13 Mar 2025) GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing arXiv

(3 Mar 2025) MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation arXiv

(23 Jan 2025) Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step arXiv

Bench/Dataset

(22 Jul 2025) ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering arXiv

(22 Jul 2025) Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning arXiv

(12 Jun 2025) VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos arXiv

(12 Jun 2025) MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning arXiv

(6 Jun 2025) PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts arXiv

(5 Jun 2025) VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos arXiv

(5 Jun 2025) MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark arXiv

(15 May 2025) StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation arXiv

(13 May 2025) VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models arXiv

(1 May 2025) MINERVA: Evaluating Complex Video Reasoning arXiv

(30 Apr 2025) GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling arXiv

(21 Apr 2025) IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs arXiv

(21 Apr 2025) VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models arXiv

(17 Apr 2025) Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark arXiv

(16 Apr 2025) FLIP Reasoning Challenge arXiv

(14 Apr 2025) VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge arXiv

(8 Apr 2025) ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering arXiv

(8 Apr 2025) V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models arXiv

(8 Apr 2025) MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models arXiv

(4 Apr 2025) Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme arXiv

(15 Feb 2025) SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding arXiv

(14 Feb 2025) MM-RLHF: The Next Step Forward in Multimodal LLM Alignment arXiv

(13 Feb 2025) MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency arXiv

(18 Dec 2024) Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. arXiv

(22 Nov 2024) VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection. arXiv code

(18 Oct 2024) MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps arXiv

(7 Jul 2024) VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool arXiv

(20 Jun 2024) MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding arXiv

(12 Jun 2024) LVBench: An Extreme Long Video Understanding Benchmark arXiv

(24 Apr 2024) Cantor: Inspiring Multimodal Chain-of-Thought of MLLM arXiv

(16 Apr 2024) OpenEQA: Embodied Question Answering in the Era of Foundation Models arXiv

(17 Aug 2023) EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding arXiv

(23 May 2023) Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought. arXiv

(18 May 2021) NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions arXiv

Latent

(12 Feb 2025) Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning arXiv

(7 Feb 2025) Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach arXiv

(9 Dec 2024) Training Large Language Models to Reason in a Continuous Latent Space arXiv

Open Source Project

https://github.com/Hui-design/Open-LLaVA-Video-R1

https://github.com/SkyworkAI/Skywork-R1V

https://huggingface.co/papers/2503.05379

https://github.com/Osilly/Vision-R1

https://github.com/ModalMinds/MM-EUREKA

https://github.com/OpenRLHF/OpenRLHF-M

https://github.com/Fancy-MLLM/R1-Onevision

https://github.com/om-ai-lab/VLM-R1

https://github.com/EvolvingLMMs-Lab/open-r1-multimodal

https://github.com/Deep-Agent/R1-V

https://github.com/TideDra/lmm-r1

https://github.com/tulerfeng/Video-R1

https://github.com/Wang-Xiaodong1899/Open-R1-Video