Skip to content

A project involving Python, PyTorch, ViT, and EfficientNet to rethink the traditional Vision Transformer architecture for improved face recognition performance and computational efficiency. Rethinking the model incorporating EfficientNet into ViT to overcome the drawback of traditional Vision Transformer architecture.

License

debarghamitraroy/Face-Transformer-Rethinking-model-incorporating-EfficientNet-into-ViT

Repository files navigation

Face Transformer - Rethinking model incorporating EfficientNet into ViT

Prof. Srinibas Rana*Debargha Mitra RoyBikash ShawSuprio Kundu

Jalpaiguri Government Engineering College

Publication Implementation


Recently there has been great interests of Transformer not only in NLP but also in Computer Vision (CV). We wonder if transformer can be used in face recognition by incorporating EfficientNet into ViT and whether it is better than CNNs. Therefore, we investigate the performance of Transformer models in face recognition. The models are trained on a large scale face recognition database Casia-Webface and evaluated on several mainstream benchmarks, including LFW, SLLFW, CALFW, CPLFW, TALFW, CFP-FP & AGEDB databases. We demonstrate that Transformer models achieve comparable performance as CNN with similar number of parameters and MACs. The Face-Transformer mainly uses ViT (Vision Transformer) architecture. Now we demonstrate if we can transfer learn and fine-tune the model with EfficientNet & merge it into ViT to get a better results.

arch

Abstract:

We propose a hybrid architecture synergistically combining the strengths of both architectures, aiming for robust and efficient face recognition. Face recognition has achieved remarkable progress in recent years, but challenges remain in terms of robustness, efficiency, and scalability. Transformers have emerged as powerful models for various vision tasks, but their direct application to face recognition faces challenges due to computational cost and potential overfitting. EfficientNets, on the other hand, offer a balance of accuracy and efficiency in convolutional neural networks. In this work, we propose a novel approach that rethinks face transformers by integrating EfficientNets with ViT. We explore a hybrid architecture that leverages the strengths of both transformers and EfficientNets, aiming to achieve robust and efficient face recognition. We employ EfficientNets as a backbone for feature extraction, extracting informative and compact features while maintaining computational efficiency. Our findings demonstrate that the proposed hybrid architecture significantly surpasses existing methods in face recognition performance while maintaining excellent computational efficiency. It paves the way for developing robust, efficient, and scalable face recognition systems with diverse applications, ranging from security and access control to personalized user experiences and social media.

Objectives

  • To learn a representation of face images that is invariant to variations in lighting, pose, and expression.

  • To achieve state-of-the-art results on face recognition benchmarks by fine-tuning with EfficientNet and introduce the model into ViT.

  • To be robust to variations in the quality of the input images by evaluating LFW, SLLFW, CALFW, CPLFW, TALFW, CFP-FP & AGEDB evaluation databases.

  • To make it efficient in terms of computational cost and memory.

Model Architecture

Model Architechture

Usage Instructions

1. Preparation

This code is mainly adopted from Vision Transformer, DeiT & Face Evolve. In addition to PyTorch and torchvision, install vit_pytorch by Phil Wang, efficientnet_pytorch by Luke Melas-Kyriazi & package timm by Ross Wightman. Sincerely appreciate for their contributions.

All needed Packages are found in requirements.txt. Simply install all packages by:

pip install -r requirements.txt

Files of vit_pytorch folder.

.
├── __init__.py
├── vit.py
├── vit_face.py
└── vits_face.py

Files of util folder.

.
├── __init__.py
├── test.py
├── utils.py
└── verification.py

2. Databases

3. Train Models

  • EfficientNet + ViT

    CUDA_VISIBLE_DEVICES='0' python3 -u train.py -b <batch_size> -w 0 -d casia -n <network_name> -head CosFace --outdir <path_to_model> --warmup-epochs 0 --lr 3e-5 -r <path_to_model>

4. Pretrained Models and Test Models (on LFW, SLLFW, CALFW, CPLFW, TALFW, CFP_FP, AGEDB)

You can download the following models -

Model Google Drive
ViT-P8S8 LINK
EfficientNet + ViT LINK

You can test Models -

The content of property file for casia-webface dataset is as follows: $10572, 112, 112$

python3 test.py --model <path_to_model> --network <network_name> --batch_size <batch_size> --target <eval_data>

References

This is the research paper of Face Transformer for Recognition, forked from zhongyy/Face-Transformer.

Contact

If you have any questions, please create an issue on this repository or contact at debarghamitraroy@gmail.com

About

A project involving Python, PyTorch, ViT, and EfficientNet to rethink the traditional Vision Transformer architecture for improved face recognition performance and computational efficiency. Rethinking the model incorporating EfficientNet into ViT to overcome the drawback of traditional Vision Transformer architecture.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published