Topic | Description |
---|---|
LVLM Model | Large multimodal models / Foundation Model |
Multimodal Benchmark & Dataset | π Interesting Multimodal Benchmark and Dataset |
LVLM Agent | Agent & Application of LVLM |
LVLM Hallucination | Benchmark & Methods for Hallucination |
Title | Venue/Date | Note | Code | Picture |
---|---|---|---|---|
MMMU: A Massive Multi-discipline Multimodal |
CVPR 2024 | 11K Multimodal Questions Reasoning Benchmark | project | |
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought |
ACL 2024 | Multimodal COT: Multi-step visual modal reasoning | project | ![]() |
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor |
MM 2024 | Multimodal Correction | Github | ![]() |
Right this way: Can VLMs Guide Us to See More to Answer Questions? |
NeurIPS 2024 | For visually impaired people | Github | ![]() |
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models |
NeurIPS 2024 | Multimodal Refinement 100K data | Project | ![]() |
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model |
EMNLP 2024 | Abstract Image Reasoning Benchmark | Project | ![]() |
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning |
AAAI 2025 | Math Reasoning & Weak2Strong Data |
Project | ![]() |
Multimodal Situational Safety |
ICLR 2025 | Multimodal Safety Benchmark | Project | ![]() |
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos |
ICLR 2025 | MMMU in Video QA | Project | ![]() |
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? |
ICLR 2025 | High Resolution Image | Project | ![]() |
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining |
2025-01 | Educational Video to Textbook | Project | ![]() |
Holistic Evaluation for Interleaved Text-and-Image Generation |
EMNLP 2024 | Interleaved Text-Image Generation Benchmark | Project | ![]() |
A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation |
2024-11 | Interleaved T-I Generation More Scenarios Judge Model |
Project | ![]() |
An Enhanced MultiModal ReAsoning Benchmark |
2025-01 | Multimodal COT | Project | ![]() |
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models |
ICLR 2025 | Multimodal COT Interleaved Generation |
Project | ![]() |
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding |
ICLR 2025 | Physical Wold Understanding | Project | ![]() |
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark |
ICLR 2025 | MMMU Audio Version | Project | ![]() |
VLMaterial: Procedural Material Generation with Large Vision-Language Models |
ICLR 2025 | Material -> Python Code | Project | ![]() |
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models |
CVPR 2025 | Vision-language generative reward | Project | ![]() |
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning |
CVPR 2025 | Self-Critique | Project | ![]() |
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation |
CVPR 2025 | Unified MCQ for VQA Dataset | Project | ![]() |
Towards Universal Soccer Video Understanding |
CVPR 2025 | Soccer MLLM Event Classification Commentary Generation Foul Recognition |
Project | ![]() |
CVPR 2025 | Video CoT Benchmark Fine-Grained |
Project | ![]() |
|
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts |
CVPR 2025 | Multi-Images Math |
Project | ![]() |
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models |
CVPR 2025 | Long CoT | Project | ![]() |
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models |
2025-03 | VisualPRM | Project | ![]() |
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning |
CVPR 2025 | Critic for MM Reasoning | Project | ![]() |
TheoremExplainAgent: Towards Multimodal Explanations for LLM |
ACL 2025 | Visualization Reasoning | Project | ![]() |
Title | Venue/Date | Note | Code | Picture |
---|---|---|---|---|
Evaluating Object Hallucination in Large Vision-Language Models |
EMNLP 2023 | Simple Object Hallunicattion Evaluation - POPE | Github | ![]() |
Evaluation and Analysis of Hallucination in Large Vision-Language Models |
2023-10 | Hallunicattion Evaluation - HaELM | Github | ![]() |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning |
2023-06 | GPT4-Assisted Visual Instruction Evaluation (GAVIE) & LRV-Instruction | Github | |
Woodpecker: Hallucination Correction for Multimodal Large Language Models |
2023-10 | First work to correct hallucinations in LVLMs | Github | ![]() |
Can We Edit Multimodal Large Language Models? |
EMNLP 2023 | Knowledge Editing Benchmark | Github | ![]() |
Grounding Visual Illusions in Language:Do Vision-Language Models Perceive Illusions Like Humans? |
EMNLP 2023 | Similar to human illusion? | Github | ![]() |