Author: NevroHelios Last Updated: 11 June 2025
Compare video (MELD) and image (IFEED) emotion recognition performance through:
- Modality-specific baseline models
- Cross-dataset transfer learning experiments
- Character-wise analysis (Friends TV cast)
# Download from official source
git clone https://github.com/declare-lab/MELD.git
mv MELD/data/MELD.Raw data/meld_raw
# Request dataset from original paper authors
wget researchlab2.iiit.ac.in/ifeed/IFEED_170x140_v3.tar.gz
tar -xzf IFEED_170x140_v3.tar.gz -C data/ifeed_raw
conda create -n emotion python=3.12
conda activate emotion
pip install -r requirements.txt
Model | Val Accuracy | F1-Score | Inference Speed |
---|---|---|---|
CNN-LSTM (Baseline) | 63.2% | 0.61 | 87ms/video |
Custom 3D-CNN | 67.8% | 0.65 | 104ms/video |
-
MELD (Video, Text, Audio)
- Dataloader for MELD (video & text)
- Vision model class (pretrained
rd3_18
) - Text model class (pretrained
bert
) - Audio model class (Conformer, non-pretrained)
- Multimodal dataloader (combine modalities)
- FusionModel (text + audio + video)
- Integrate vision encoder (
1r3d-18
) - Integrate text encoder (
bert
) - Integrate audio encoder (
conformer
) - Attention-based fusion layer
- Integrate vision encoder (
- Train & benchmark pretrained models
- Evaluate need for custom models (if pretrained underperforms)
- Benchmark on MELD (accuracy, F1, speed)
-
IFEED (Images)
- Dataloader for IFEED (170x140px images)
- Vision model class (ResNet-50 baseline, pretrained)
- Training & inference pipeline
- Benchmark on IFEED37
-
Deployment
- Web demo using Gradio
- SaaS API endpoint (FastAPI)
Metric | MELD Target | IFEED Target |
---|---|---|
Accuracy | 72% | 85% |
F1-Score | 0.71 | 0.87 |
Inference Speed | Note: Dataset licenses |