Caibo Feng, Mar 2025
caiboVon@gmail.com
Maybe I won, maybe I failed. Whatever, I can only keep going.㊗️ I am currently a master student, whose research interest focuses on Deepfake Detection. I am seeking opportunities of job. If you are interested in my work, please feel free to contact me.🤝
v0.1 - 15 Mar 2025 v0.5 - 30 Mar 2025
Deepfake techniques bring a significant risk to face recognition system, but existing Deepfake Detection (DfD) methods are often overfitted to the specific forgery patterns from the training samples, and thus perform poorly on unseen samples. Concretely, DfD is a simple binary classification task, but it is not that simple as we think. Followings are a pair of real/fake images from popular FF++ training set. Usually, the existing works adopt the pre-trained face detection model to crop the face region for the deepfake detector. Since this preprocessing operation make the detector focus on learning the forgery clues around face region.
Real Face | Fake Face |
---|---|
![]() |
![]() |
Almost all of the existing works training their detectors on the popular FF++ ddataset for fair comparison. However, it is a dataset that released in 2019! The fake images here (see above figure) is so easy to detect, even by human eyes.
On the other hand, the FF++ dataset provides the fake images forged by 4️⃣ fogery methods, which means it only contains four forgery patterns. It is insufficient to train a relatively useful detector for now.
In order to publish papers, researchers still only conduct their experiments on FF++. Personally speaking, it is so boring. This practice is harmful to develop a practical deepfake detector, and stops our steps to go forward.
The authors of different papers crop the faces in the training and test datasets with different bounding boxes. Some authors crop the face regions with more background, but some authors may solely crop the face region with less background. This practice effects the final performance of detectors with different backbones (CNN or ViT). Additionally, it also makes the comparison unfair.
AUC, AP, EER are the most used metric in the papers of DfD. Since they demonstrate the comprehensive performance of detectors. Nevertheless, the accuracy (ACC) of almost every methods is low, which means these detectors still have no power to deal with the samples from different domains,and can not be put into the real-world scenario.
When training the detectors with the latest DF40 dataset, you will find the detectors perform worse on the relatively old test sets (e.g. DFDCP, DFDC, DFD, WDF, etc.). This situation means that there is a large gap between the forgery patterns of old and new forgery methods. Then, here comes a question: is a single detector suficcent to solve all forgery samples?
Currently, CLIP-ViT is discovered as the best backbone of the detectors. However, some work proves that ViT recognizes the real/fake images via high-level semantic information, instead of low-level forgery clues. The high-level semantic information is difficult to explain, and human eyes can not distinguish between real/fake images. Therefore, it is hard for us to annotate the text description for a fake face image, which means we can not re-train a backbone like CLIP or LLaVA with image-text pairs.
iv. Model updating methodd to handle constantly incremental forgery patterns (like a planning agent)
TBD