This is a repository for organizing papres related to Multimodal Reasoning in Multimodal Large Language Models (Image, Video).
With the development of the visual (audio) capabilities and reasoning capabilities (RL powered) of multimodal large language models(MLLMs/LVLMs/LSLMs), researchers have high hopes for the multimodal reasoning capabilities of MLLM/LVLM/LSLM.
This repo also select paper about visual generation (image generation/video generation) with RL/CoT.
(8 May 2025) Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
(30 Apr 2025) Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models
(4 Apr 2025) Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
(18 Mar 2025) Aligning Multimodal LLM with Human Preference: A Survey
(16 Mar 2025) Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
(30 Jul 2025) MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention
(28 Jul 2025) Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback
(24 Jul 2025) MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning
(24 Jul 2025) SafeWork-R1: Coevolving Safety and Intelligence under the AI-45 Law
(22 Jul 2025) C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
(22 Jul 2025) Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning
(11 Jul 2025) M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
(3 Jul 2025) Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation
(1 Jul 2025) GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
(20 Jun 2025) GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
(16 Jun 2025) Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning
(11 Jun 2025) ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
(5 Jun 2025) Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning
(5 Jun 2025) Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos
(5 Jun 2025) MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
(16 May 2025) Visual Planning: Let's Think Only with Images
(15 May 2025) MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
(13 May 2025) OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
(12 May 2025) Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
(8 May 2025) Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
( 8 May 2025) SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models
(6 May 2025) X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains
(6 May 2025) Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
(6 May 2025) ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant
(5 May 2025) R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
(28 Apr 2025) SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning
(25 Apr 2025) Fast-Slow Thinking for Large Vision-Language Model Reasoning
(25 Apr 2025) Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
(25 Apr 2025) Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
(21 Apr 2025) A Call for New Recipes to Enhance Spatial Reasoning in MLLMs
(20 Apr 2025) Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension
(12 Apr 2025) VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
(10 Apr 2025) VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
(10 Apr 2025) SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
(10 Apr 2025) Perception-R1: Pioneering Perception Policy with Reinforcement Learning
(10 Apr 2025) Kimi-VL Technical Report
(8 Apr 2025) On the Suitability of Reinforcement Fine-Tuning to Visual Tasks
(8 Apr 2025) Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
(1 Apr 2025) Improved Visual-Spatial Reasoning via R1-Zero-Like Training
(17 Mar 2025) R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
(13 Mar 2025) VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
(9 Mar 2025) Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
(7 Mar 2025) R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning
(7 Mar 2025) Unified Reward Model for Multimodal Understanding and Generation
(7 Mar 2025) R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
(3 Mar 2025) Visual-RFT: Visual Reinforcement Fine-Tuning
(4 Feb 2025) Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
(3 Jan 2025) Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
(13 Jan 2025) Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
(10 Jan 2025) LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
(9 Jan 2025) Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
(30 Dec 2024) Slow Perception: Let's Perceive Geometric Figures Step-by-step
(19 Dec 2024) Progressive Multimodal Reasoning via Active Retrieval
(29 Nov 2024) Interleaved-Modal Chain-of-Thought
(15 Nov 2024) Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination
(15 Nov 2024) LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
(30 Oct 2024) Vision-Language Models Can Self-Improve Reasoning via Reflection
(23 Oct 2024) R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models
(21 Oct 2024) Improve Vision Language Model Chain-of-thought Reasoning
(11 Oct 2024) M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought
(6 Oct 2024) MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration
(4 Oct 2024) Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning
(29 Sep 2024) CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought
(13 Jun 2024) Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
(28 Dec 2023) Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos
(14 Dec 2023) Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models
(27 Nov 2023) Compositional Chain-of-Thought Prompting for Large Multimodal Models
(15 Nov 2023) The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
(3 May 2023) Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings
(16 Apr 2023) Chain of Thought Prompt Tuning in Vision Language Models
(2 Feb 2023) Multimodal Chain-of-Thought Reasoning in Language Models
(12 Jun 2025) CogStream: Context-guided Streaming Video Question Answering
(6 Jun 2025) VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning
(27 Mar 2025) Video-R1: Reinforcing Video Reasoning in MLLMs
(17 Feb 2025) video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
(10 Feb 2025) CoS: Chain-of-Shot Prompting for Long Video Understanding
(8 Jan 2025) Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
(3 Dec 2024) VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation
(2 Dec 2024) Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation
(29 Nov 2024) STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training
(21 Oct 2024) Improve Vision Language Model Chain-of-thought Reasoning
(12 Oct 2024) Interpretable Video based Stress Detection with Self-Refine Chain-of-thought Reasoning
(27 Sep 2024) Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks
(28 Aug 2024) Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
(24 May 2024) Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models
(7 May 2024) Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. code
(8 Oct 2024) Temporal Reasoning Transfer from Text to Video.
(22 Jul 2025) Step-Audio 2 Technical Report
(14 Mar 2025) Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
(28 Jul 2025) Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
(20 Jun 2025) RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought
(17 Jun 2025) SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
(16 May 2025) Towards Self-Improvement of Diffusion Models via Group Preference Optimization
(16 May 2025) Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models
(15 May 2025) Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models
(12 May 2025) DanceGRPO: Unleashing GRPO on Visual Generation
(8 May 2025) Flow-GRPO: Training Flow Matching Models via Online RL
(1 May 2025) T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
(22 Apr 2025) From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
(22 Apr 2025) Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning
(26 Mar 2025) MMGen: Unified Multi-modal Image Generation and Understanding in One Go
(13 Mar 2025) GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
(3 Mar 2025) MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation
(23 Jan 2025) Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
(22 Jul 2025) ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering
(22 Jul 2025) Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
(12 Jun 2025) VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
(12 Jun 2025) MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
(6 Jun 2025) PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
(5 Jun 2025) VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
(5 Jun 2025) MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
(15 May 2025) StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation
(13 May 2025) VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models
(1 May 2025) MINERVA: Evaluating Complex Video Reasoning
(30 Apr 2025) GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling
(21 Apr 2025) IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
(21 Apr 2025) VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
(17 Apr 2025) Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark
(16 Apr 2025) FLIP Reasoning Challenge
(14 Apr 2025) VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge
(8 Apr 2025) ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering
(8 Apr 2025) V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models
(8 Apr 2025) MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
(4 Apr 2025) Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme
(15 Feb 2025) SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
(14 Feb 2025) MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
(13 Feb 2025) MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
(18 Dec 2024) Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces.
(22 Nov 2024) VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection. code
(18 Oct 2024) MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps
(7 Jul 2024) VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool
(20 Jun 2024) MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
(12 Jun 2024) LVBench: An Extreme Long Video Understanding Benchmark
(24 Apr 2024) Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
(16 Apr 2024) OpenEQA: Embodied Question Answering in the Era of Foundation Models
(17 Aug 2023) EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
(23 May 2023) Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought.
(18 May 2021) NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
(12 Feb 2025) Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning
(7 Feb 2025) Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
(9 Dec 2024) Training Large Language Models to Reason in a Continuous Latent Space
https://github.com/Hui-design/Open-LLaVA-Video-R1
https://github.com/SkyworkAI/Skywork-R1V
https://huggingface.co/papers/2503.05379
https://github.com/Osilly/Vision-R1
https://github.com/ModalMinds/MM-EUREKA
https://github.com/OpenRLHF/OpenRLHF-M
https://github.com/Fancy-MLLM/R1-Onevision
https://github.com/om-ai-lab/VLM-R1
https://github.com/EvolvingLMMs-Lab/open-r1-multimodal
https://github.com/Deep-Agent/R1-V
https://github.com/TideDra/lmm-r1