This project focuses on developing a unified multimodal framework for Automatic Unit (AU) detection and Expression recognition using a Transformer-based approach. The goal is to effectively fuse static (images, text) and dynamic (audio) features to enhance the accuracy and robustness of facial expression analysis systems.
- Propose a unified Multimodal framework for AU detection and Expression recognition.
- Incorporate basic expressions, Action Units (AU), and Valence-Arousal (VA) features into the model.
Facial Expression Analysis is a critical area in computer vision and human-computer interaction, with applications ranging from emotion recognition systems to virtual agents and affective computing.
The project will utilize the following modalities for feature extraction and fusion:
- Images (static features)
- Text (such as transcripts or textual descriptions associated with the facial expressions)
- Audio (dynamic features)
-
Zhang, Wei, et al. "Transformer-based multimodal information fusion for facial expression analysis." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
-
Kim, Jun-Hwa, Namho Kim, and Chee Sun Won. "Multi-modal facial expression recognition with transformer-based fusion networks and dynamic sampling." arXiv preprint arXiv:2303.08419 (2023).