Kun-Hsiang Lin1 Yu-Wen Tseng1 Kang-Yang Huang1 Jhih-Ciang Wu2 Wen-Huang Cheng1
1 National Taiwan University 2 National Taiwan Normal University

Overview: InstructFLIP is a unified instruction-tuned framework that leverages vision-language models and a meta-domain strategy to achieve efficient face anti-spoofing generalization without redundant cross-domain training.
- This paper proposes InstructFLIP, a novel instruction-tuned VLM framework for FAS, which integrates textual supervision to enhance semantic understanding of spoofing cues.
- We design a content-style decoupling mechanism that explicitly separates spoof-related (content) and spoof-irrelevant (style) information, improving generalization to unseen domains.
- We introduce a meta-domain learning strategy to eliminate training redundancy in cross-domain settings by utilizing diverse image-instruction pairs sampled from a structured meta-domain.
- Experimental results demonstrate that InstructFLIP surpasses SOTA methods across multiple FAS benchmarks, effectively capturing spoof-related patterns through language-guided supervision while substantially reducing training overhead, thereby enhancing its applicability in real-world scenarios.
We recommend using Docker to run the code, which can ensure a consistent environment across different machines.
Install git-lfs first, then clone the repository:
git clone https://github.com/kunkunlin1221/InstructFLIP.git
cd docker
bash build.sh
cd ..
bash docker/run.sh scripts/train_instruct_flip.sh
bash docker/run.sh ablations/loss/$Ablation_Settings.sh # Ablation_Settings: The script name in ablations/loss
bash docker/run.sh ablations/data/$Ablation_Settings.sh # Ablation_Settings: The script name in ablations/data
bash docker/run.sh ablations/data/train_intruct_flip_$Ablation_Settings.sh # Ablation_Settings: The script name in ablations/branch
bash docker/run.sh ablations/data/train_mc.sh
If you're using this work in your research or applications, please cite using this BibTeX:
@InProceedings{lin2025instructflip,
author = {Kun-Hsiang Lin and Yu-Wen Tseng and Kang-Yang Huang and Jhih-Ciang Wu and Wen-Huang Cheng},
title = {InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
year = {2025},
organization = {ACM},
}