👏 Welcome to the Awesome-Reasoning-MLLM repository! This repository is a curated collection of the most influential papers, code, dataset, benchmarks, and resources about Reasoning in Multi-Modal Large Language Models (MLLMs) and Vision-Language Models (VLMs).
Feel free to ⭐ star and fork this repository to keep up with the latest advancements and contribute to the community.
- [LMM-R1, 2503] LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities [Paper📑][Code🔧]
- [R1-VL, 2503] R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization [Paper📑][Code🔧]
- [Vision-R1, 2503] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [Paper📑] [Code🔧]
- [VisualThinker-R1-Zero, 2503] R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [Paper📑] [Code🔧]
- [MM-Eureka, 2503] MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [Paper📑] [Code🔧]
- [Visual-RFT, 2503] Visual-RFT: Visual Reinforcement Fine-Tuning [Paper📑] [Code🔧]
- [VLM-R1] VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [Code🔧]
- [R1-V, 2502] R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 [Code🔧]
- [AStar, 2502] Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking [Paper📑]
- [Mulberry, 2412] Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [Paper📑] [Code🔧]
- [MM-Verify, 2502] MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification [Paper📑] [Code🔧]
- [LlamaV-o1, 2501] LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [Paper📑] [Code🔧]
- [Insight-V, 2411] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models [Paper📑] [Code🔧]
- [LLaVA-CoT, 2411] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [Paper📑] [Code🔧]
- [LLaVA-Reasoner, 2410] Improve Vision Language Model Chain-of-thought Reasoning [Paper📑] [Code🔧]
- [Visual-CoT, 2403] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [Paper📑] [Code🔧]