This repository is an official implementation of CSCL.
- [2025/6/9] Camera Ready version is released.
- [2025/6/9] Codes and weights are released.
- [2025/2/27] CSCL is accepted by CVPR 2025🎉🎉.
conda create -n CSCL python=3.8
conda activate CSCL
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/cu111/torch_stable.html
pip install -r code/MultiModal-DeepFake-main/requirements.txt
Here are the pre-trained model:
Download meter_clip16_224_roberta_pretrain.ckpt: link
Download ViT-B-16.pt: link
Download roberta-base: link
Download Datasets: link
The Folder structure:
./
├── code
│ └── MultiModal-Deepfake (this github repo)
│ ├── configs
│ │ └──...
│ ├── dataset
│ │ └──...
│ ├── models
│ │ └──...
│ ...
│ ├── roberta-base
│ ├── ViT-B-16.pt
│ └── meter_clip16_224_roberta_pretrain.ckpt
└── datasets
└── DGM4
├── manipulation
├── origin
└── metadata
Our pre-trained CSCL model: link (96.34 AUC, 92.48 mAP, 84.07 IoUm, 76.62 F1) (We use train and val set for training and use test set for evaluation.)
Make a folder ./results/CSCL/ and put the pre-trained model in it.
sh train.sh
Evaluation
sh test.sh
Visualization
use visualize_res function in utils.py (refer to test.py for details).
Evaluation on text or image subset
refer to line 136 in test.py.
We thank these great works and open-source codebases: DGM4, METER,
If you find our work is useful, please give this repo a star and cite our work as:
@inproceedings{li2025unleashing,
title={Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation},
author={Li, Yiheng and Yang, Yang and Tan, Zichang and Liu, Huan and Chen, Weihua and Zhou, Xu and Lei, Zhen},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={9242--9252},
year={2025}
}