Skip to content

GML-FMGroup/Awesome-MLLM-Reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 

Repository files navigation

Awesome-MLLM-Reasoning

We have witnessed the tremendous potential of pure reinforcement learning (RL) in enhancing LLM reasoning capabilities, and a growing body of research is now extending this potential to the multimodal domain. This repository will continuously update the latest papers, covering how reinforcement learning techniques can optimize reasoning performance in multimodal tasks (e.g., visual question answering, cross-modal reasoning, and more).

🚀 Here, you’ll be at the forefront of "RL + MLLM Reasoning" research! Whether you're a researcher, engineer, or student, this repository will help you quickly master how RL empowers complex reasoning and decision-making in MLLMs.

Watch & Star 🌟—let’s explore the future of multimodal intelligence together!

Papers 📄

  1. STAR-R1: Spacial TrAnsformation Reasoning by Reinforcing Multimodal LLMs. [code] Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, Wenbing Huang. Preprint'25

  2. Visual Agentic Reinforcement Fine-Tuning. [code] Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang. Preprint'25

  3. Visual Planning: Let's Think Only with Images. [code] Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić. Preprint'25

  4. GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning. Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, Bryan Hooi. Preprint'25

  5. OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning. [code] Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng. Preprint'25

  6. DanceGRPO: Unleashing GRPO on Visual Generation. [project] [code] Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo. Preprint'25

  7. Flow-GRPO: Training Flow Matching Models via Online RL. [code] Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang. Preprint'25

  8. X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains. [code] Qianchu Liu, Sheng Zhang, Guanghui Qin, Timothy Ossowski, Yu Gu, Ying Jin, Sid Kiblawi, Sam Preston, Mu Wei, Paul Vozila, Tristan Naumann, Hoifung Poon. Preprint'25

  9. T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT. [code] Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li. Preprint'25

  10. Fast-Slow Thinking for Large Vision-Language Model Reasoning. Wenyi Xiao, Leilei Gan, Weilong Dai, Wanggui He, Ziwei Huang, Haoyuan Li, Fangxun Shu, Zhelun Yu, Peng Zhang, Hao Jiang, Fei Wu. Preprint'25

  11. SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models. [project] [code] [dataset] Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, Cihang Xie. Preprint'25

  12. Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning. [code] Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou. Preprint'25

  13. Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding. [code] Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu. Preprint'25

  14. Compile Scene Graphs with Reinforcement Learning. [code] Zuyao Chen, Jinlin Wu, Zhen Lei, Marc Pollefeys, Chang Wen Chen. Preprint'25

  15. NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation. [code] Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, Michael Qizhe Shieh. Preprint'25

  16. Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning. [project] [code] Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, Xinlei Chen, Yong Li, Wenwu Zhu. Preprint'25

  17. SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement. [code] Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang. Preprint'25

  18. Perception-R1: Pioneering Perception Policy with Reinforcement Learning. [code] En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, Wenbing Tao. Preprint'25

  19. VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning. [project] [code] [model] Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen. Preprint'25

  20. Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning. [code] Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Kaipeng Zhang. Preprint'25

  21. VisRL: Intention-Driven Visual Perception via Reinforced Reasoning. [code] Zhangquan Chen, Xufang Luo, Dongsheng Li. Preprint'25

  22. Improved Visual-Spatial Reasoning via R1-Zero-Like Training. [code] Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, Zhijie Deng. Preprint'25

  23. Q-Insight: Understanding Image Quality via Visual Reinforcement Learning. [code] Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, Jian Zhang. Preprint'25

  24. UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning. [code] Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, Hongsheng Li. Preprint'25

  25. Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning. Ram Ramrakhya, Matthew Chang, Xavier Puig, Ruta Desai, Zsolt Kira, Roozbeh Mottaghi. Preprint'25

  26. Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1. Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu. Preprint'25

  27. Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models. Zhaoxin Li, Zhang Xi-Jia, Batuhan Altundas, Letian Chen, Rohan Paleja, Matthew Gombolay. Preprint'25

  28. GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing. Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li. Preprint'25

  29. Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme. [code] Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, Pengfei Liu. Preprint'25

  30. LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs. [project] [code] [model] Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan. Preprint'25

  31. LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL. [code] Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, Xu Yang. Preprint'25

  32. R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model.[code] [model] Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, Xu Yang. Preprint'25

  33. R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning. [code] [model] Jiaxing Zhao, Xihan Wei, Liefeng Bo. Preprint'25

  34. Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement. [code] [model] Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, Jiaya Jia. Preprint'25

  35. Visual-RFT: Visual Reinforcement Fine-Tuning. [code] Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang. Preprint'25

  36. MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning. Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, Daniel Rueckert. Preprint'25

  37. Virgo: A Preliminary Exploration on Reproducing o1-like MLLM.[code] Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen. Preprint'25

  38. Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step.[code] [model] Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng. Preprint'25

  39. R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization.[code] [model] Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, Wei Chen. Preprint'25

  40. Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models. [code] Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, Shaohui Lin. Preprint'25

  41. Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models. [code] [model]

    Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu. Preprint'24

  42. LLaVA-CoT: Let Vision Language Models Reason Step-by-Step. [code] [model]

    Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, Li Yuan. Preprint'24

  43. Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models. [project] [code] Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna. Preprint'24

  44. Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search.[code] Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, Dacheng Tao. Preprint'24

  45. Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.[project] [code] [model] Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai. Preprint'24

  46. Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search.[code] Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, Dacheng Tao. Preprint'24

Benchmark & Evaluation🤗

  1. LMGAME-BENCH: How Good are LLMs at Playing Games? Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, Hao Zhang. Preprint'25

  2. GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling. Siqi Li, Yufan Shen, Xiangnan Chen, Jiayi Chen, Hengwei Ju, Haodong Duan, Song Mao, Hongbin Zhou, Bo Zhang, Pinlong Cai, Licheng Wen, Botian Shi, Yong Liu, Xinyu Cai, Yu Qiao. Preprint'25

  3. MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science. Erle Zhu, Yadi Liu, Zhe Zhang, Xujun Li, Jin Zhou, Xinjie Yu, Minlie Huang, Hongning Wang. Preprint'25

  4. VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models. [code] Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu. Preprint'25

  5. Are Large Vision Language Models Good Game Players? [code] Xinyu Wang, Bohan Zhuang, Qi Wu. Preprint'25

  6. VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models. [project] [code] [dataset] Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu. Preprint'25

  7. EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges. [project] Clinton J. Wang, Dean Lee, Cristina Menghini, Johannes Mols, Jack Doughty, Adam Khoja, Jayson Lynch, Sean Hendryx, Summer Yue, Dan Hendrycks. Preprint'25

  8. NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models. Pranshu Pandya, Vatsal Gupta, Agney S Talwarr, Tushar Kataria, Dan Roth, Vivek Gupta. Preprint'25

  9. Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models. [code] [dataset] Ilias Stogiannidis, Steven McDonagh, Sotirios A. Tsaftaris. Preprint'25

  10. Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark. [project] [code] [dataset] Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng. Preprint'25

  11. VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge. [project] [code] [dataset] Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, Xiang Yue. Preprint'25

  12. V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models. [code] Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, Lijuan Wang. Preprint'25

  13. CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation. [code] [dataset] Jixuan Leng, Chengsong Huang, Langlin Huang, Bill Yuchen Lin, William W. Cohen, Haohan Wang, Jiaxin Huang. Preprint'25

  14. Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency. Zhikai Wang, Jiashuo Sun, Wenqi Zhang, Zhiqiang Hu, Xin Li, Fan Wang, Deli Zhao. Preprint'25

  15. VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models. [project] [code] [dataset] Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu. Preprint'25

  16. Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation. [project] [code] Phillip Y. Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, Minhyuk Sung Preprint'25

  17. Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark. [code] Hanlei Zhang, Zhuohang Li, Yeshuang Zhu, Hua Xu, Peiwu Wang, Haige Zhu, Jie Zhou, Jinchao Zhang. Preprint'25

  18. ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering. [code] Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, Megh Thakkar, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty. Preprint'25

  19. GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning. Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, Bo Zheng. Preprint'25

  20. VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models. [project] [code] [dataset] Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, Filippos Kokkinos. Preprint'25

  21. VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning. [project] Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, Feng Zhao. Preprint'25

  22. MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models. [dataset]

    Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, Kaipeng Zhang. Preprint'25

  23. LIVEVQA: Live Visual Knowledge Seeking. [dataset] Mingyang Fu, Yuyang Peng, Benlin Liu, Yao Wan, Dongping Chen. CVPR 2025

  24. MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts. [project] [code] Peijie Wang, Zhongzhi Li, Dekang Ran, Fei Yin, Chenglin Liu. CVPR 2025

  25. MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models. [project] [code] [dataset] Huanqia Cai, Yijun Yang, Winston Hu. Preprint'25

  26. MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency. [project] [code] [dataset] Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li. Preprint'25

  27. ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models. [project] [code] [dataset] Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, David I. Atkinson, Aaditya Baranwal, Alexandru Coca, Mikah Dang, Sebastian Dziadzio, Jakob D. Kunz, Kaiqu Liang, Alexander Lo, Brian Pulfer, Steven Walton, Charig Yang, Kai Han, Samuel Albanie. Preprint'25

  28. MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning. [code] [model] [dataset] Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, Wenqi Shao. Preprint'25

  29. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. [project] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao. ICLR 2024

  30. MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark. [code] Minxuan Zhou, Hao Liang, Tianpeng Li, Zhiyu Wu, Mingan Lin, Linzhuang Sun, Yaqi Zhou, Yan Zhang, Xiaoqin Huang, Yicong Chen, Yujing Qiao, Weipeng Chen, Bin Cui, Wentao Zhang, Zenan Zhou Preprint'24

  31. Template-Driven LLM-Paraphrased Framework for Tabular Math Word Problem Generation. [project] [code] [dataset] Xiaoqiang Kang, Zimu Wang, Xiaobo Jin, Wei Wang, Kaizhu Huang, Qiufeng Wang. Preprint'24

  32. How Far Are We from Intelligent Visual Deductive Reasoning? Yizhe Zhang, He Bai, Ruixiang Zhang, Jiatao Gu, Shuangfei Zhai, Josh Susskind, Navdeep Jaitly. Preprint'24

  33. Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset. [project] [code] [dataset] Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, Hongsheng Li Preprint'24

  34. VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning. Jingkun Ma, Runzhe Zhan, Derek F. Wong, Yang Li, Di Sun, Hou Pong Chan, Lidia S. Chao. Preprint'24

  35. [MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?](Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li) [project] Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, Hongsheng Li . ECCV 2024

  36. BLINK: Multimodal Large Language Models Can See but Not Perceive. [project] [code] [dataset] Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, Ranjay Krishna. ECCV 2024

  37. Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models. [code] Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, Roy Ka-Wei Lee. Preprint'24

  38. HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks. [code] Fengji Zhang, Linquan Wu, Huiyu Bai, Guancheng Lin, Xiao Li, Xiao Yu, Yue Wang, Bei Chen, Jacky Keung. Preprint'24

  39. FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts. Shubhankar Singh, Purvi Chaurasia, Yerram Varun, Pranshu Pandya, Vatsal Gupta, Vivek Gupta, Dan Roth. Preprint'24

  40. GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training. [code]

    Renqiu Xia, Mingsheng Li, Hancheng Ye, Wenjie Wu, Hongbin Zhou, Jiakang Yuan, Tianshuo Peng, Xinyu Cai, Xiangchao Yan, Bin Wang, Conghui He, Botian Shi, Tao Chen, Junchi Yan, Bo Zhang. Preprint'24

  41. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. [code] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan. Preprint'23

  42. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. [code] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, Enamul Hoque. Preprint'22

  43. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. [code] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, Ashwin Kalyan. NeurIPS 2022

  44. GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning. [code] Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, Liang Lin. Preprint'21

  45. FigureQA: An Annotated Figure Dataset for Visual Reasoning. Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Akos Kadar, Adam Trischler, Yoshua Bengio. Preprint'17

Survey📖

  1. Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models. [code] Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang. Preprint'25
  2. Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey. [code] Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, William Wang, Ziwei Liu, Jiebo Luo, Hao Fei. Preprint'25
  3. Mind with Eyes: from Language Reasoning to Multimodal Reasoning. Zhiyu Lin, Yifei Gao, Xian Zhao, Yunfan Yang, Jitao Sang. Preprint'25
  4. A Survey on Multimodal Benchmarks: In the Era of Large AI Models. Lin Li, Guikun Chen, Hanrong Shi, Jun Xiao, Long Chen. Preprint'24

Contributing⭐

This is an active repository and your contributions are always welcome!

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •