This project aims to assess the validity of using eye motion features as indicators of cognitive load and emotional state. Ocular features such as saccades and fixations, subjective questionnaires and task performance measures has been used to assess the mental workload while the user performs the designed tasks. Simultaneous acquisition of physiological signal is performed, which is currently the high-speed image sequence of user’s face as input. Facial micro-expression intensity has been estimated using ResNet-18 based model trained with Knowledge-Distillation from Masked Auto-Encoder, along with ocular features while the users were shown emotional stimuli.
The scheme has been validated with psychological tests such as visual response test (VRT) which induces mental fatigue and N-back test which induces memory load. Correlation between the physiological signal (eye movement and blink) and the psychological test response has been observed with the change in mental workload. Moreover, user emotional state has been observed to have correlation with eye motion behaviour and has been validated with the detection of activation of corresponding facial action units based on Facial Action Coding System (FACS).
Module | Contents | |
---|---|---|
1 | Eye Tracking | Eye detection, blink detection, eye motion features classification, screen gaze, emotion classification |
2 | Visual Response Test | Psychometric test game made using PyGame |
3 | Facial Expression Estimation | Deep learning model for facial Action Unit intensity estimation |
- Developed face and facial landmark detection pipeline for video, and performed pupil localization by radial inspection of gradients. Developed CUDA algorithms using Numba to accelerate execution speed by 300 times.
- Alternatively, used Mediapipe Face Landmark model to detect and track eye iris, eye corners and eyelid coordinates.
- Performed blink detection, and iris location correction using Cubic Spline interpolation.
- Screen gaze and heatmap estimation using Polynomial Regression. Classified eye motion into saccades and fixations using iris velocity and dispersion based thresholds.
- Designed psychometric game(s): Visual Response Test and N-Back Test, which can induce different levels of mental workload.
- Built deep learning model using ResNet-18 by pre-training with large facial expression datasets (AffectNet and EmotioNet) and further fine tuned with action unit intensity labels (DISFA dataset) along with simultaneous Knowledge-Distillation from a larger ViT (Vision-Transformer) based Masked Auto-Encoder model to estimate facial micro-expressions.
- The objective is to estimate facial emotion from expression using facial Action Units (AUs) intensity and FACS (Facial Action Coding System) in real-time.
- To accomplish this, used a large-scale pre-trained network (Masked Auto-encoder) and performed feature-wise knowledge distillation with task-specific fine-tuning on a lightweight model (ResNet-18) to get facial Action Unit intensity in real-time.
- Designed visual emotion stimuli to induce different emotions and simultaneously acquire eye coordinates and face video to estimate eye-motion features and facial micro-expression corresponding to the shown emotion stimulus.
- The training method has been adapted from Chang et. al.
- A ViT (Vision-Transformer) based Masked Auto-encoder (MAE) is used which was pre-trained in self-supervised manner (masked input image reconstruction task) with EmotioNet dataset, to eliminate the limitation of lacking labeled training data. Subsequently, only the encoder part is extracted and attached to a linear classification layer, and further pre-trained on AffectNet and FFHQ dataset which are large facial expression datasets before finally fine-tuning on the DISFA dataset for facial Action Unit intensity estimation.
- Now, since the MAE is a large model, to perform faster and real-time estimation, employed feature-wise knowledge distillation to transfer the teacher model's (MAE) knowledge to a lightweight student model (ResNet-18).
- The ResNet-18 model with linear classification layer attached, is first pre-trained on the same AffectNet and FFHQ datasets and then fine-tuned with simultaneous knowledge distillation from teacher model on DISFA dataset for facial Action-Unit intensity estimation.
- Using the facial Action Unit intensity values, their activation is assessed and based on FACS (Facial Action Coding System) which defines a relation between the action units and emotion, the overall facial emotion is estimated.
-
Feature Matching Loss: A MSE loss between hidden feature layers of the teacher model and the student model.
$\mathcal{L}_{FM} = \left\|f_{T} - \textbf{I}(f_{S})\right\| ^{2}$ -
KL Divergence Loss: between teacher model's output for (i) the input face image and (ii) the student model's hidden feature layer input to the teacher linear classifiation layer.
$\mathcal{L}_{KLD} = -\widehat{y}_{T} ~ \text{log} (\frac{\widehat{y}_{T}}{\widehat{y}_{S}})$ -
Task Loss: The training MSE loss for the student network.
$\mathcal{L}_{Task} = \left\|\widehat{y} - y\right\| ^{2}$
-
Overall Loss:
$\mathcal{L} = \mathcal{L}_{FM} + \alpha\mathcal{L}_{Task} + \beta\mathcal{L}_{KLD}$
Performance on DISFA datast for facial Action Unit intensity estimation task:
Method | PCC | MAE | MSE | Remarks |
---|---|---|---|---|
ResNet-18 | 0.518 | 0.278 | 0.352 | - |
ResNet-18 + Pre-Train | 0.614 | 0.236 | 0.260 | - |
ResNet-18 + FM Distill | 0.628 | 0.231 | 0.260 | Better performance and faster |
MAE + Pre-Train | 0.674 | 0.202 | 0.270 | Best performance, but heavy and slowest |
Dataset | Type | Size | Features |
---|---|---|---|
EmotioNet | Image | 9,75,000 | 8 Emotions |
AffectNet | Image | 4,50,000 | 16 Overall Emotions, 6 Basic Emotions |
DISFA | Video | 27 | 12 Action Units |
Saurabh Chatterjee
MTech, Signal Processing and Machine Learning
Indian Institute of Technology (IIT) Kharagpur
- D. Chang, Y. Yin, Z. Li, M. Tran and M. Soleymani, "LibreFace: An Open-Source Toolkit for Deep Facial Expression Analysis," in 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2024 pp. 8190-8200. doi: 10.1109/WACV57701.2024.00802.