This repository is designed to collect and categorize papers related to Multimodal Retrieval-Augmented Generation (RAG) according to our survey paper: Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. Given the rapid growth in this field, we will continuously update both the paper and this repository to serve as a resource for researchers working on future projects.
- June 2, 2025: A new enhanced version of our paper is out now on arXiv! This update also includes new related papers and covers new topics such as agentic interaction and audio-centric retrieval.
- May 15, 2025: This paper has been accepted for publication in the ACL 2025 Findings.
- April 18, 2025: Our website for this topic is up now.
- February 17, 2025: We release the first survey for Multimodal Retrieval-Augmented Generation. Feel free to cite, contribute, or open a pull request to add recent related papers!
- ๐ General Pipeline
- ๐ฟ Taxonomy of Recent Advances and Enhancements
- โ Taxonomy of Application Domains
- ๐ Abstract
- ๐ Overview of Popular Datasets
- ๐ Papers
- ๐ RAG-related Surveys
- ๐ Retrieval Strategies Advances
- ๐ Fusion Mechanisms
- ๐ Augmentation Techniques
- ๐ค Generation Techniques
- ๐ง Training Strategies and Loss Function
- ๐ก๏ธ Robustness and Noise Management
- ๐ Taks Addressed by Multimodal RAGs
- ๐ Evaluation Metrics
- ๐ Citations
- ๐ง Contact
Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG extends this approach by incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges beyond those in unimodal RAG.
This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, loss functions, and agent-based approaches, while also exploring the diverse Multimodal RAG scenarios. In addition, we outline open challenges and future directions to guide research in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases.
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
LAION-400M | 200M imageโtext pairs; used for pre-training multimodal models. | Image, Text | LAION-400M |
Conceptual-Captions (CC) | 15M imageโcaption pairs; multilingual EnglishโGerman image descriptions. | Image, Text | Conceptual Captions |
CIRR | 36,554 triplets from 21,552 images; focuses on natural image relationships. | Image, Text | CIRR |
MS-COCO | 330K images with captions; used for caption-to-image and image-to-caption generation. | Image, Text | MS-COCO |
Flickr30K | 31K images annotated with five English captions per image. | Image, Text | Flickr30K |
Multi30K | 30K German captions from native speakers and human-translated captions. | Image, Text | Multi30K |
NoCaps | For zero-shot image captioning evaluation; 15K images. | Image, Text | NoCaps |
Laion-5B | 5B imageโtext pairs used as external memory for retrieval. | Image, Text | LAION-5B |
COCO-CN | 20,341 images for cross-lingual tagging and captioning with Chinese sentences. | Image, Text | COCO-CN |
CIRCO | 1,020 queries with an average of 4.53 ground truths per query; for composed image retrieval. | Image, Text | CIRCO |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
BDD-X | 77 hours of driving videos with expert textual explanations; for explainable driving behavior. | Video, Text | BDD-X |
YouCook2 | 2,000 cooking videos with aligned descriptions; focused on videoโtext tasks. | Video, Text | YouCook2 |
ActivityNet | 20,000 videos with multiple captions; used for video understanding and captioning. | Video, Text | ActivityNet |
SoccerNet | Videos and metadata for 550 soccer games; includes transcribed commentary and key event annotations. | Video, Text | SoccerNet |
MSR-VTT | 10,000 videos with 20 captions each; a large video description dataset. | Video, Text | MSR-VTT |
MSVD | 1,970 videos with approximately 40 captions per video. | Video, Text | MSVD |
LSMDC | 118,081 videoโtext pairs from 202 movies; a movie description dataset. | Video, Text | LSMDC |
DiDemo | 10,000 videos with four concatenated captions per video; with temporal localization of events. | Video, Text | DiDemo |
Breakfast | 1,712 videos of breakfast preparation; one of the largest fully annotated video datasets. | Video, Text | Breakfast |
COIN | 11,827 instructional YouTube videos across 180 tasks; for comprehensive instructional video analysis. | Video, Text | COIN |
MSRVTT-QA | Video question answering benchmark. | Video, Text | MSRVTT-QA |
MSVD-QA | 1,970 video clips with approximately 50.5K QA pairs; video QA dataset. | Video, Text | MSVD-QA |
ActivityNet-QA | 58,000 humanโannotated QA pairs on 5,800 videos; benchmark for video QA models. | Video, Text | ActivityNet-QA |
EpicKitchens-100 | 700 videos (100 hours of cooking activities) for online action prediction; egocentric vision dataset. | Video, Text | EPIC-KITCHENS-100 |
Ego4D | 4.3M videoโtext pairs for egocentric videos; massive-scale egocentric video dataset. | Video, Text | Ego4D |
HowTo100M | 136M video clips with captions from 1.2M YouTube videos; for learning textโvideo embeddings. | Video, Text | HowTo100M |
CharadesEgo | 68,536 activity instances from egoโexo videos; used for evaluation. | Video, Text | Charades-Ego |
ActivityNet Captions | 20K videos with 3.7 temporally localized sentences per video; dense-captioning events in videos. | Video, Text | ActivityNet Captions |
VATEX | 34,991 videos, each with multiple captions; a multilingual video-and-language dataset. | Video, Text | VATEX |
Charades | 9,848 video clips with textual descriptions; a multimodal research dataset. | Video, Text | Charades |
WebVid | 10M videoโtext pairs (refined to WebVid-Refined-1M). | Video, Text | WebVid |
Youku-mPLUG | Chinese dataset with 10M videoโtext pairs (refined to Youku-Refined-1M). | Video, Text | Youku-mPLUG |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
LibriSpeech | 1,000 hours of read English speech with corresponding text; ASR corpus based on audiobooks. | Audio, Text | LibriSpeech |
SpeechBrown | 55K paired speech-text samples; 15 categories covering diverse topics from religion to fiction. | Audio, Text | SpeechBrown |
AudioCap | 46K audio clips paired with human-written text captions. | Audio, Text | AudioCaps |
AudioSet | 2M human-labeled sound clips from YouTube across diverse audio event classes (e.g., music or environmental). | Audio | AudioSet |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
MIMIC-CXR | 125,417 labeled chest X-rays with reports; widely used for medical imaging research. | Image, Text | MIMIC-CXR |
CheXpert | 224,316 chest radiographs of 65,240 patients; focused on medical analysis. | Image, Text | CheXpert |
MIMIC-III | Health-related data from over 40K patients; includes clinical notes and structured data. | Text | MIMIC-III |
IU-Xray | 7,470 pairs of chest X-rays and corresponding diagnostic reports. | Image, Text | IU-Xray |
PubLayNet | 100,000 training samples and 2,160 test samples built from PubLayNet for document layout analysis. | Image, Text | PubLayNet |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
Fashion-IQ | 77,684 images across three categories; evaluated with Recall@10 and Recall@50 metrics. | Image, Text | Fashion-IQ |
FashionGen | 260.5K imageโtext pairs of fashion images and item descriptions. | Image, Text | FashionGen |
VITON-HD | 83K images for virtual try-on; high-resolution clothing items dataset. | Image, Text | VITON-HD |
Fashionpedia | 48,000 fashion images annotated with segmentation masks and fine-grained attributes. | Image, Text | Fashionpedia |
DeepFashion | Approximately 800K diverse fashion images for pseudo triplet generation. | Image, Text | DeepFashion |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
VQA | 400K QA pairs with images for visual question-answering tasks. | Image, Text | VQA |
PAQ | 65M text-based QA pairs; a large-scale dataset for open-domain QA tasks. | Text | PAQ |
ELI5 | 270K complex questions augmented with web pages and images; designed for long-form QA tasks. | Text | ELI5 |
OK-VQA | 14K questions requiring external knowledge for visual question answering tasks. | Image, Text | OK-VQA |
WebQA | 46K queries requiring reasoning across text and images; multimodal QA dataset. | Text, Image | WebQA |
Infoseek | Fine-grained visual knowledge retrieval using a Wikipedia-based knowledge base (~6M passages). | Image, Text | Infoseek |
ClueWeb22 | 10 billion web pages organized into subsets; a large-scale web corpus for retrieval tasks. | Text | ClueWeb22 |
MOCHEG | 15,601 claims annotated with truthfulness labels and accompanied by textual and image evidence. | Text, Image | MOCHEG |
VQA v2 | 1.1M questions (augmented with VG-QA questions) for fine-tuning VQA models. | Image, Text | VQA v2 |
A-OKVQA | Benchmark for visual question answering using world knowledge; around 25K questions. | Image, Text | A-OKVQA |
XL-HeadTags | 415K news headline-article pairs spanning 20 languages across six diverse language families. | Text | XL-HeadTags |
SEED-Bench | 19K multiple-choice questions with accurate human annotations across 12 evaluation dimensions. | Text | SEED-Bench |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
ImageNet | 14M labeled images across thousands of categories; used as a benchmark in computer vision research. | Image | ImageNet |
Oxford Flowers102 | Dataset of flowers with 102 categories for fine-grained image classification tasks. | Image | Oxford Flowers102 |
Stanford Cars | Images of different car models (five examples per model); used for fine-grained categorization tasks. | Image | Stanford Cars |
GeoDE | 61,940 images from 40 classes across six world regions; emphasizes geographic diversity in object recognition. | Image | GeoDE |
- RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing
- Retrieval-Augmented Generation for Large Language Models: A Survey
- Retrieval-Augmented Generation for Natural Language Processing: A Survey
- Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
- Graph Retrieval-Augmented Generation for Large Language Models: A Survey
- Graph Retrieval-Augmented Generation: A Survey
- Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make Your LLMs Use External Data More Wisely
- Trustworthiness in Retrieval-Augmented Generation Systems: A Survey
- A Survey on Retrieval-Augmented Text Generation for Large Language Models
- Searching for Best Practices in Retrieval-Augmented Generation
- Old IR Methods Meet RAG
- A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models
- Benchmarking Large Language Models in Retrieval-Augmented Generation
- A Survey on Retrieval-Augmented Text Generation
- Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation
- RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
- ADQ: Adaptive Dataset Quantization
- Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search
- DeeperImpact: Optimizing Sparse Learned Index Structures
- MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality
- BanditMIPS: Faster Maximum Inner Product Search in High Dimensions
- Revisiting Neural Retrieval on Accelerators
- Query-Aware Quantization for Maximum Inner Product Search
- RA-CM3: Retrieval-Augmented Multimodal Language Modeling
- FARGO: Fast Maximum Inner Product Search via Global Multi-Probing
- MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
- TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s
- ScaNN: Accelerating large-scale inference with anisotropic vector quantization
- GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
- Mi-RAG: Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering
- VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
- MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Module Plugin
- Ovis: Structural Embedding Alignment for Multimodal Large Language Model
- UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
- UniVL-DR: Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning
- FLAVA: A Foundational Language And Vision Alignment Model
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- CLIP: Learning Transferable Visual Models From Natural Language Supervision
- ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- M2RAG: Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines
- OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems
- CRAG: Corrective Retrieval Augmented Generation
- PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers
- XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags
- RAFT: Adapting Language Model to Domain Specific RAG
- BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
- GTE: Towards General Text Embeddings with Multi-stage Contrastive Learning
- Re-Imagen: Retrieval-Augmented Text-to-Image Generator
- Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning
- Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
- VISA: Retrieval Augmented Generation with Visual Source Attribution
- eCLIP: Improving Medical Multi-modal Contrastive Learning with Expert Annotations
- EchoSight: Advancing Visual-Language Models with Wiki Knowledge
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
- XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags
- Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
- VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering
- RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
- Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
- VideoRAG: Retrieval-Augmented Generation over Video Corpus
- VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
- Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
- DrVideo: Document Retrieval Based Long Video Understanding
- OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
- Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval
- iRAG: Advancing RAG for Videos with an Incremental Approach
- CTCH: Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
- MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval
- CA-CLAP: Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining
- RECAP: Retrieval-Augmented Audio Captioning
- SpeechRAG: Speech Retrieval-Augmented Generation without Automatic Speech Recognition
- WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
- SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation
- Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation
- DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
- P2PCAP: Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning
- LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation
- Contextual asr with retrieval augmented large language model.
- SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding
- ColPali: Efficient Document Retrieval with Vision Language Models
- VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
- DSE: Unifying Multimodal Retrieval via Document Screenshot Embedding
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
- mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
- CREAM: Coarse-to-Fine Retrieval and Multi-modal Efficient Tuning for Document VQA
- ColQwen2: Enhancing Vision-Language Model's Perception of the World at Any Resolution
- mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
- DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding
- Robust Multi Model RAG Pipeline For Documents Containing Text, Table & Images
- ViTLP: Visually Guided Generative Text-Layout Pre-training for Document Intelligence
- Hybrid RAG-empowered Multi-modal LLM for Secure Data Management in Internet of Medical Things: A Diffusion-based Contract Approach
- MSIER: How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
- RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models
- M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
- Re-ranking the Context for Multimodal Retrieval Augmented Generation
- RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance
- mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
- OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems
- RAGTrans: Retrieval-Augmented Hypergraph for Multimodal Social Media Popularity Prediction
- LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval
- EgoInstructor: Retrieval-Augmented Egocentric Video Captioning
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
- GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
- MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
- MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering
- MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation
- RAFT: Adapting Language Model to Domain Specific RAG
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
- Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
- MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
- Large Language Models Know What is Key Visual Entity: An LLM-assisted Multimodal Retrieval for VQA
- VISA: Retrieval Augmented Generation with Visual Source Attribution
- Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
- RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
- Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
- MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
- C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
- REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
- Re-Imagen: Retrieval-Augmented Text-to-Image Generator
- AlzheimerRAG: Multimodal Retrieval Augmented Generation for PubMed articles
- EMERGE: Integrating RAG for Improved Multimodal EHR Predictive Modeling
- Retrieval-Augmented Hypergraph for Multimodal Social Media Popularity Prediction
- M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- Retrieval-augmented egocentric video captioning
- MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval
- RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
- MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
- Hybrid RAG-Empowered Multi-Modal LLM for Secure Data Management in Internet of Medical Things: A Diffusion-Based Contract Approach
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
- Self-adaptive Multimodal Retrieval-Augmented Generation
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity - CVPR 2024
- Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval
- PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
- Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control
- Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
- Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
- Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering
- EMERGE: Enhancing Multimodal Electronic Health Records Predictive Modeling with Retrieval-Augmented Generation
- Img2Loc: Revisiting Image Geolocalization Using Multi-Modality Foundation Models and Image-Based Retrieval-Augmented Generation
- Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
- Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent
- MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
- mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
- RAGAR, Your Falsehood Radar: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models
- OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems
- Self-adaptive Multimodal Retrieval-Augmented Generation
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- Enhancing Multi-modal Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning? (MSIER)
- RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning
- RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
- Retrieval-Augmented Multimodal Language Modeling (RA-CM3)
- VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
- RAGAR, Your Falsehood Radar: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models
- Self-adaptive Multimodal Retrieval-Augmented Generation
- LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval
- MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
- Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning
- MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
- mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
- Rule: Reliable multimodal rag for factuality in medical vision language models
- MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training (RagVL)
- SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information
- Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
- InstructBLIP: towards general-purpose vision-language models with instruction tuning
- MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering
- VISA: Retrieval Augmented Generation with Visual Source Attribution
- OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems
- Improving Medical Multi-modal Contrastive Learning with Expert Annotations
- EchoSight: Advancing Visual-Language Models with Wiki Knowledge
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
- HACL: Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
- Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control
- REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
- AlzheimerRAG: Multimodal Retrieval Augmented Generation for PubMed articles
- Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
- MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning
- RAGTrans: Retrieval-Augmented Hypergraph for Multimodal Social Media Popularity Prediction
- MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training (RagVL)
- RA-CM3: Retrieval-Augmented Multimodal Language Modeling
- Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
- Self-adaptive Multimodal Retrieval-Augmented Generation
- Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval
- Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training
- Ragar, your falsehood radar: Rag-augmented reasoning for political fact-checking using multimodal large language models
- M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
- A comprehensive survey of hallucination mitigation techniques in large language models
- Ramm: Retrieval-augmented biomedical visual question answering with multi-modal pre-training
- Retrieval-augmented multimodal language modeling
- REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
- Re-imagen: Retrieval-augmented text-to-image generator
- MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
- Fact-aware multimodal retrieval augmentation for accurate medical radiology report generation
- Hybrid RAG-Empowered Multi-Modal LLM for Secure Data Management in Internet of Medical Things: A Diffusion-Based Contract Approach
- RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models
- AsthmaBot: Multi-modal, Multi-Lingual Retrieval Augmented Generation For Asthma Patient Support
- REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records Analysis via Large Language Models
- Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning
- Docprompting: Generating code by retrieving the docs
- RACE: Retrieval-augmented commit message generation
- Retrieval Augmented Code Generation and Summarization
- Unifashion: A unified vision-language model for multimodal fashion retrieval and generation
- Multi-modal Retrieval Augmented Generation for Product Query
- LLM4DESIGN: An Automated Multi-Modal System for Architectural and Environmental Design
- SoccerRAG: Multimodal Soccer Information Retrieval via Natural Queries
- Predicting Micro-video Popularity via Multi-modal Retrieval Augmentation
- ENWAR: A RAG-empowered Multi-Modal LLM Framework for Wireless Environment Perception
- Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications
- Img2Loc: Revisiting image geolocalization using multi-modality foundation models and image-based retrieval-augmented generation
- RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
-
Recall@K, Precision@K, F1 Score, and MRR:
- OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
- EMERGE: Integrating RAG for Improved Multimodal EHR Predictive Modeling
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
- Large Language Models Know What is Key Visual Entity: An LLM-assisted Multimodal Retrieval for VQA
- Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering
- RAGAR, Your Falsehood Radar: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models
- Self-adaptive Multimodal Retrieval-Augmented Generation
- Rule: Reliable multimodal rag for factuality in medical vision language models
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training (RagVL)
- M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval
- Retrieval-augmented egocentric video captioning
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
- MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval
- Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
- REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records Analysis via Large Language Models
- Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control
- VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering
- MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
- OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
-
min(+P, Se):
It represents the minimum value between precision (+P) and sensitivity (Se), providing a balanced measure of model performance.
- EMERGE: Integrating RAG for Improved Multimodal EHR Predictive Modeling
- REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records Analysis via Large Language Models
- Fluency (FL):
- Accuracy:
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- MRAG-BENCH: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
- Enhancing Textbook Question Answering Task with Large Language Models and Retrieval Augmented Generation
- mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
- How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
- RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-Augmented Generation via Knowledge-Enhanced Reranking and Noise-Injected Training (RagVL)
- Advanced Embedding Techniques in Multimodal Retrieval Augmented Generation: A Comprehensive Study on Cross Modal AI Applications
- RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- Retrieval Meets Reasoning: Even High-School Textbook Knowledge Benefits Multimodal Reasoning
- MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
-
Frรฉchet Inception Distance (FID), CLIP Score, Kernel Inception Distance (KID), and Inception Score (IS):
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
- C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
- Retrieval-Augmented Multimodal Language Modeling (RA-CM3)
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
-
Consensus-Based Image Description Evaluation (CIDEr):
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- Retrieval-augmented egocentric video captioning
- RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
- MSIER: How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
- C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
- Retrieval-Augmented Multimodal Language Modeling (RA-CM3)
- REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
-
SPICE:
-
SPIDEr:
- Frรฉchet Audio Distance (FAD), Overall Quality (OVL), and Text Relevenace (REL):
-
BLEU, METEOR, and ROUGE-L:
- Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
- XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags
- Rule: Reliable multimodal rag for factuality in medical vision language models
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- AsthmaBot: Multi-modal, Multi-Lingual Retrieval Augmented Generation For Asthma Patient Support
- Advanced Embedding Techniques in Multimodal Retrieval Augmented Generation: A Comprehensive Study on Cross Modal AI Applications
- RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- Retrieval-augmented egocentric video captioning
- RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation
-
Exact Match (EM):
- OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
- Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering
- Large Language Models Know What is Key Visual Entity: An LLM-assisted Multimodal Retrieval for VQA
- Self-adaptive Multimodal Retrieval-Augmented Generation
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training (RagVL)
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
- MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
- OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
-
BERTScore:
- Spearmanโs Rank Correlation (SRC):
-
Average Retrieval Time per Query:
-
FLOPs (Floating Point Operations):
-
Response Time:
-
Execution Time:
-
Average Retrieval Number (ARN):
- Clinical Relevance (CR):
- Geodesic Distance:
This README is a work in progress and will be completed soon. Stay tuned for more updates!
If you find our paper or repository useful, please cite the paper:
@misc{abootorabi2025askmodalitycomprehensivesurvey,
title={Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation},
author={Mohammad Mahdi Abootorabi and Amirhosein Zobeiri and Mahdi Dehghani and Mohammadali Mohammadkhani and Bardia Mohammadi and Omid Ghahroodi and Mahdieh Soleymani Baghshah and Ehsaneddin Asgari},
year={2025},
eprint={2502.08826},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.08826},
}
If you have questions, please send an email to mahdi.abootorabi2@gmail.com.