Prof. Srinibas Rana* Debargha Mitra Roy Bikash Shaw Suprio Kundu
Jalpaiguri Government Engineering College
Recently there has been great interests of Transformer not only in NLP but also in Computer Vision (CV). We wonder if transformer can be used in face recognition by incorporating EfficientNet into ViT and whether it is better than CNNs. Therefore, we investigate the performance of Transformer models in face recognition. The models are trained on a large scale face recognition database Casia-Webface and evaluated on several mainstream benchmarks, including LFW, SLLFW, CALFW, CPLFW, TALFW, CFP-FP & AGEDB databases. We demonstrate that Transformer models achieve comparable performance as CNN with similar number of parameters and MACs. The Face-Transformer mainly uses ViT (Vision Transformer) architecture. Now we demonstrate if we can transfer learn and fine-tune the model with EfficientNet & merge it into ViT to get a better results.
We propose a hybrid architecture synergistically combining the strengths of both architectures, aiming for robust and efficient face recognition. Face recognition has achieved remarkable progress in recent years, but challenges remain in terms of robustness, efficiency, and scalability. Transformers have emerged as powerful models for various vision tasks, but their direct application to face recognition faces challenges due to computational cost and potential overfitting. EfficientNets, on the other hand, offer a balance of accuracy and efficiency in convolutional neural networks. In this work, we propose a novel approach that rethinks face transformers by integrating EfficientNets with ViT. We explore a hybrid architecture that leverages the strengths of both transformers and EfficientNets, aiming to achieve robust and efficient face recognition. We employ EfficientNets as a backbone for feature extraction, extracting informative and compact features while maintaining computational efficiency. Our findings demonstrate that the proposed hybrid architecture significantly surpasses existing methods in face recognition performance while maintaining excellent computational efficiency. It paves the way for developing robust, efficient, and scalable face recognition systems with diverse applications, ranging from security and access control to personalized user experiences and social media.
-
To learn a representation of face images that is invariant to variations in lighting, pose, and expression.
-
To achieve state-of-the-art results on face recognition benchmarks by fine-tuning with EfficientNet and introduce the model into ViT.
-
To be robust to variations in the quality of the input images by evaluating
LFW,SLLFW,CALFW,CPLFW,TALFW,CFP-FP&AGEDBevaluation databases. -
To make it efficient in terms of computational cost and memory.
This code is mainly adopted from Vision Transformer, DeiT & Face Evolve. In addition to PyTorch and torchvision, install vit_pytorch by Phil Wang, efficientnet_pytorch by Luke Melas-Kyriazi & package timm by Ross Wightman. Sincerely appreciate for their contributions.
All needed Packages are found in requirements.txt. Simply install all packages by:
pip install -r requirements.txtFiles of vit_pytorch folder.
.
├── __init__.py
├── vit.py
├── vit_face.py
└── vits_face.py
Files of util folder.
.
├── __init__.py
├── test.py
├── utils.py
└── verification.py
-
You can download the training databases, CASIA-Webface (version -
casia-webface), and put it in folderData.Dataset Baidu Netdisk Password Google Drive Onedrive Website GitHub Kaggle ms1m-retinafaceLINK 4ouwLINK CASIA-WebfaceLINK LINK LINK UMDFaceLINK LINK VGG2LINK LINK MS1M-IBUGLINK MS1M-ArcFaceLINK LINK MS1M-RetinaFaceLINK 8eb3LINK Asian-CelebLINK Glint-MiniLINK 10m5Glint360KLINK o3azDeepGlintLINK WebFace260MLINK IMDB-FaceCeleb500kMegaFaceLINK 5f8mLINK DigiFace-1MLINK LINK -
You can download the testing databases as follows and put them in folder
eval.Dataset Baidu Netdisk Password Google Drive LFWLINK dfj0LINK SLLFWLINK l1z6LINK CALFWLINK vvqeLINK CPLFWLINK jyp9LINK TALFWLINK izrgLINK CFP_FPLINK 4femLINK AGEDBLINK rlqfLINK refers to Insightface
-
EfficientNet + ViTCUDA_VISIBLE_DEVICES='0' python3 -u train.py -b <batch_size> -w 0 -d casia -n <network_name> -head CosFace --outdir <path_to_model> --warmup-epochs 0 --lr 3e-5 -r <path_to_model>
| Model | Google Drive |
|---|---|
ViT-P8S8 |
LINK |
EfficientNet + ViT |
LINK |
The content of property file for casia-webface dataset is as follows:
python3 test.py --model <path_to_model> --network <network_name> --batch_size <batch_size> --target <eval_data>This is the research paper of Face Transformer for Recognition, forked from zhongyy/Face-Transformer.
If you have any questions, please create an issue on this repository or contact at debarghamitraroy@gmail.com

