This repo is the official implementation of Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning accepted by AAAI 2025.
Paper Link: https://arxiv.org/pdf/2501.01120
Important
We have completed the missing setting of our work and added a "News and Updates" module.
- 2025.8.12 🔨 We release the source code for Prompt-based baselines in this repo, including MAPs and MSPs, for better follow.
- 2025.5.15 🔥 The extension of RAGPT, dubbed REDEEM, is accepted by KDD 2025 Round 2. We will release the code for this extension in https://github.com/Jian-Lang/REDEEM.
- 2024.12.10 🔥 Our work, RAGPT, which aims to tackle the incomplete multimodal learning in pre-trained multimodal transformers, is accepted by AAAI 2025.
Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pre-trained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a withinmodality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT’s robustness.

We assume training set is fully available and define the missing rate
First, clone this repo:
git clone https://github.com/Jian-Lang/RAGPT.git
cd RAGPT
First, create a new conda env for RAGPT:
conda create -n RAGPT python=3.9
Next, activate this env and install the dependencies from the requirements.txt:
conda activate RAGPT
pip install -r requirements.txt
First, download the dataset from this link: https://archive.org/download/mmimdb/mmimdb.tar.gz
Then, place the raw images in folder dataset/mmimdb/image and put the json files in folder dataset/mmimdb/meta_data.
First, download the dataset from this link: https://www.kaggle.com/datasets/parthplc/facebook-hateful-meme-dataset
Then, place the raw images in folder dataset/hatememes/image and put the json files in folder dataset/hatememes/metadata.
Next, replace the test.json in metadata with test_seen.json downloaded from this link: https://www.kaggle.com/datasets/williamberrios/hateful-memes as the test.json downloaded from the prior website has no label information for evaluation. (Do not change other files, only replace the test.json with test_seen.json)
First, download the dataset from this link: https://www.kaggle.com/datasets/gianmarco96/upmcfood101
Then, place the raw images in folder dataset/food101/image and put the csv files in folder dataset/food101/meta_data.
Note: The image folder follows the structure dataset/xxx/image/xxx.jpg|png|jpeg
Run the following script to init the dataset:
sh src/scripts/init_data.sh
Run the following script to training our model and evaluate the results:
sh src/scripts/eval.sh
All the parameters have the same meaning as describe in our paper and you can simply config them in src/config/config.yaml or in command line.
Run the following script to training baseline model and evaluate the results:
sh src/scripts/eval_baseline.sh
If you find the code useful for your research, please give us a star ⭐⭐⭐ and consider citing:
@inproceedings{lang2025retrievalaugmented,
author = {Lang, Jian and Cheng, Zhangtao and Zhong, Ting and Zhou, Fan},
booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
year = {2025},
pages = {18035--18043},
doi = {10.1609/aaai.v39i17.33984},
title = {Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning},
}