😎 Awesome-LVLMs

Related Collection

Our Paper Reading List

Topic	Description
LVLM Model	Large multimodal models / Foundation Model
Multimodal Benchmark & Dataset	😍 Interesting Multimodal Benchmark and Dataset
LVLM Agent	Agent & Application of LVLM
LVLM Hallucination	Benchmark & Methods for Hallucination

🏗️ LVLM Models

Title	Venue/Date	Note	Code
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	NeurIPS 2023	InstructBLIP	Github
Visual Instruction Tuning	NeurIPS 2023	LLaVA	GitHub
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	2023-04	mPLUG	Github
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	2023-04	MiniGPT-4	Github
TextBind: Multi-turn Interleaved Multimodal Instruction-following	2023-09	TextBind	Github
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	2023-09	BLIP-Diffusion	Github
NExT-GPT: Any-to-Any Multimodal LLM	2023-09	NeXT-GPT	Github
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions	ICLR 2024	Multi-image Reasoning	Github
Ferret: Refer and Ground Anything Anywhere at Any Granularity	ICLR 2024	Grounding	Github
LLaVA-OneVision: Easy Visual Task Transfer	Technical Report 2024-7	LLaVA-OV: Blog with details	Project
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution	Technical Report 2024-10	Qwen2-VL: Dynamic resolution & Multi-images & Video	Github
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	Technical Report 2024-12	Deepseek-VL2: MOE Tiny: 1B, Small: 3B DeepSeek-VL2: 5B	Github
DeepSeek-V3 Technical Report	Technical Report 2024-12	🧠 671B MoE parameters 🚀 37B activated 📚 14.8T tokens Blog	Project
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search	2024-12	Monte Carlo Tree Search MLLM	Project
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	ICLR 2025	Long Image Sequence	Project
Temporal Reasoning Transfer from Text to Video	ICLR 2025	Temporal Video	Project
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos	2025-01	Without LLM to Learning the Video	Project
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding	2025-01	Video LLaMA Series	Project
Qwen2.5-VL Technical Report	2025-02	Qwen2.5-VL	Project
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering	CVPR 2025	Self-RAG & Multimodal	Project
LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant	CVPR 2025	Universal Retrieval	Project
Visual-RFT: Visual Reinforcement Fine-Tuning	2025-03	RL MLLM	Project
Seed1.5-VL Technical Report	2025-04	MLLM RL	Homepage
MiMo-VL Technical Report	2025-05	MLLM	Project

📆 Multimodal Benchmark & Dataset

Title	Venue/Date	Note	Code
MMMU: A Massive Multi-discipline Multimodal	CVPR 2024	11K Multimodal Questions Reasoning Benchmark	project
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought	ACL 2024	Multimodal COT: Multi-step visual modal reasoning	project
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor	MM 2024	Multimodal Correction	Github
Right this way: Can VLMs Guide Us to See More to Answer Questions?	NeurIPS 2024	For visually impaired people	Github
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models	NeurIPS 2024	Multimodal Refinement 100K data	Project
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model	EMNLP 2024	Abstract Image Reasoning Benchmark	Project
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning	AAAI 2025	Math Reasoning & Weak2Strong Data	Project
Multimodal Situational Safety	ICLR 2025	Multimodal Safety Benchmark	Project
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos	ICLR 2025	MMMU in Video QA	Project
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	ICLR 2025	High Resolution Image	Project
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining	2025-01	Educational Video to Textbook	Project
Holistic Evaluation for Interleaved Text-and-Image Generation	EMNLP 2024	Interleaved Text-Image Generation Benchmark	Project
A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation	2024-11	Interleaved T-I Generation More Scenarios Judge Model	Project
An Enhanced MultiModal ReAsoning Benchmark	2025-01	Multimodal COT	Project
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models	ICLR 2025	Multimodal COT Interleaved Generation	Project
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding	ICLR 2025	Physical Wold Understanding	Project
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark	ICLR 2025	MMMU Audio Version	Project
VLMaterial: Procedural Material Generation with Large Vision-Language Models	ICLR 2025	Material -> Python Code	Project
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models	CVPR 2025	Vision-language generative reward	Project
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning	CVPR 2025	Self-Critique	Project
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation	CVPR 2025	Unified MCQ for VQA Dataset	Project
Towards Universal Soccer Video Understanding	CVPR 2025	Soccer MLLM Event Classification Commentary Generation Foul Recognition	Project
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-GrainedVideo Reasoning via Core Frame Selection	CVPR 2025	Video CoT Benchmark Fine-Grained	Project
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts	CVPR 2025	Multi-Images Math	Project
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models	CVPR 2025	Long CoT	Project
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models	2025-03	VisualPRM	Project
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning	CVPR 2025	Critic for MM Reasoning	Project
TheoremExplainAgent: Towards Multimodal Explanations for LLM	ACL 2025	Visualization Reasoning	Project

🎛️ LVLM Agent

Title	Venue/Date	Note	Code
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	2023-03	MM-REACT	Github
Visual Programming: Compositional visual reasoning without training	CVPR 2023 Best Paper	VISPROG (Similar to ViperGPT)	Github
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace	2023-03	HuggingfaceGPT	Github
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	2023-04	Chameleon	Github
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models	2023-05	IdealGPT	Github
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn	2023-06	AssistGPT	Github
A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning	ACM MM 2024	Multi-Agent Debate	Github
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models	NeurIPS 2024	Draw to facilitate reasoning	Project
Visual Agentic AI for Spatial Reasoning with a Dynamic API	CVPR 2025	New API Visual Program	Project
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning	2025-03	Agent LoRA	Project
Magma: A Foundation Model for Multimodal AI Agents	CVPR 2025	Multimodal Agent	Project
Olympus: A Universal Task Router for Computer Vision Tasks	CVPR 2025	Universal Task Router	Project

🤕 LVLM Hallunication

Title	Venue/Date	Note	Code
Evaluating Object Hallucination in Large Vision-Language Models	EMNLP 2023	Simple Object Hallunicattion Evaluation - POPE	Github
Evaluation and Analysis of Hallucination in Large Vision-Language Models	2023-10	Hallunicattion Evaluation - HaELM	Github
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	2023-06	GPT4-Assisted Visual Instruction Evaluation (GAVIE) & LRV-Instruction	Github
Woodpecker: Hallucination Correction for Multimodal Large Language Models	2023-10	First work to correct hallucinations in LVLMs	Github
Can We Edit Multimodal Large Language Models?	EMNLP 2023	Knowledge Editing Benchmark	Github
Grounding Visual Illusions in Language:Do Vision-Language Models Perceive Illusions Like Humans?	EMNLP 2023	Similar to human illusion?	Github

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.gitignore.rtf		.gitignore.rtf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

😎 Awesome-LVLMs

Related Collection

Our Paper Reading List

🏗️ LVLM Models

📆 Multimodal Benchmark & Dataset

🎛️ LVLM Agent

🤕 LVLM Hallunication

About

Uh oh!

Releases

Packages

Languages

Gary-code/Awesome-LVLM-paper

Folders and files

Latest commit

History

Repository files navigation

😎 Awesome-LVLMs

Related Collection

Our Paper Reading List

🏗️ LVLM Models

📆 Multimodal Benchmark & Dataset

🎛️ LVLM Agent

🤕 LVLM Hallunication

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages