CVPR 2025 CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, Liang He.
Multimodal large language models (MLLMs) have garnered widespread attention from researchers due to their remarkable understanding and generation capabilities in visual language tasks (e.g., visual question answering). However, the rapid pace of knowledge updates in the real world makes offline training of MLLMs costly, and when faced with non-stationary data streams, MLLMs suffer from catastrophic forgetting during learning. In this paper, we propose an MLLMs-based dual momentum Mixture-of-Experts (CL-MoE) framework for continual visual question answering (VQA). We integrate MLLMs with continual learning to utilize the rich commonsense knowledge in LLMs. We introduce a Dual-Router MoE (RMoE) strategy to select the global and local experts using task-level and instance-level routers, to robustly assign weights to the experts most appropriate for the task. Then, we design a dynamic Momentum MoE (MMoE) to update the parameters of experts dynamically based on the relationships between the experts and tasks/instances, so that the model can absorb new knowledge while maintaining existing knowledge. The extensive experimental results indicate that our method achieves state-of-the-art performance on 10 VQA tasks, proving the effectiveness of our approach.
- Clone this repository and navigate to CLMoE folder
git clone https://github.com/ECNU-ICALK/CL-MoE.git
cd CL-MoE
- Install Package
conda create -n clmoe python=3.10 -y
conda activate clmoe
pip install --upgrade pip
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
This repo is based on LLaVA. If you meet a problem, maybe you could find some solutions in issuses.
Please download the images from the COCO2014 dataset,include train2014 and val2014.
Please download the instruction from CL4VQA.
First, downloading the pretrained projectors in LLaVA Model_Zoo.
Setting pretrain_mm_mlp_adapter
to the projector path.
You could modify the deepspeed config
to change the deepspeed config.
We provide the scripts of our train order in scripts/CLMoE/Train
.
Note, the output_dir
of the previous script is the previous_task_model_path
of the next training process.
Then, you could tune these datasets in your order.
We have prepared the scripts to evaluate the trained model in scripts/CLMoE/Eval
.
@article{huai2025cl,
title={CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering},
author={Huai, Tianyu and Zhou, Jie and Wu, Xingjiao and Chen, Qin and Bai, Qingchun and Zhou, Ze and He, Liang},
journal={arXiv preprint arXiv:2503.00413},
year={2025}
}
LLaVA: the codebase we built upon, and our base model LLaVA-1.5-7b that has the amazing vision-language capabilities!
If you have any questions about CL-MOE, please leave us a comment in the issue.