This project implements a speech emotion recognition system using deep learning to classify emotions from audio recordings. The system leverages a Convolutional Neural Network (CNN) architecture to identify 8 distinct emotions: neutral, calm, happy, sad, angry, fearful, disgust, and surprised.
The project explores various machine learning approaches including Decision Trees, Random Forests, Multi-Layer Perceptrons (MLP), and ultimately, a sophisticated CNN model that achieves the best performance.
The project uses the RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset, which contains audio recordings of actors expressing different emotions.
- Training data: Actors 1–19 (
Actor_01
toActor_19
) - Test data: Actors 20–24 (
Actor_20
toActor_24
)
- Audio files are in .wav format
- File naming convention:
03-01-[emotion]-[intensity]-[statement]-[repetition]-[actor].wav
01
: neutral02
: calm03
: happy04
: sad05
: angry06
: fearful07
: disgust08
: surprised
The final model is a Convolutional Neural Network (CNN) with depthwise separable convolution blocks.
- Input: 3-channel mel-spectrogram representation of audio
- Initial Convolution: 7×7 kernels with BatchNorm and ReLU activation
- Block 1: Depthwise + Pointwise convolutions (64→128 channels)
- Block 2: Depthwise + Pointwise convolutions (128→256 channels)
- Block 3: Depthwise + Pointwise convolutions (256→512 channels)
- Global Average Pooling: Reduces spatial dimensions to 1×1
- Fully Connected Layers: 512 → 256 → 8 units
- Dropout: 0.5 rate for regularization
- Output: 8 classes corresponding to the emotions
- Load audio using
torchaudio
- Convert stereo to mono if necessary
- Extract mel-spectrogram features:
n_fft=1024
hop_length=512
n_mels=64
- Convert to decibel scale
- Pad or crop spectrograms to fixed dimensions
- Replicate to 3 channels for CNN input
- Data Split: 66:34 train:validation stratified split
- Batch Size: 32
- Loss Function: Cross-Entropy Loss
- Optimizer: Adam with initial learning rate 0.001
- Learning Rate Schedule: ReduceLROnPlateau (factor 0.5, patience 5)
- Training Duration: 70 epochs
- Model Selection: Based on best validation loss
- Accuracy: 62.33%
- Per-Emotion Performance: Detailed precision, recall, and F1 scores for each emotion
- Confusion Matrix: Visualization showing common misclassifications
- Per-Actor Analysis: Shows performance variation due to speaking styles
Initial_Models.ipynb
: Decision Trees, Random Forests, and MLP modelscnn-final-ee708-project.ipynb
: Final CNN implementation and trainingEvaluation_Test_Data.ipynb
: Test dataset evaluation and metricsconfusion_matrix.png
: Confusion matrix visualizationtest_results.csv
: Classification results on test data
Model | Validation Accuracy |
---|---|
Decision Tree | ~42% |
Random Forest | ~64% |
MLP | ~50% |
CNN | ~79.4% ✅ |
Python 3.8+
PyTorch 1.8+
torchaudio
librosa
scikit-learn
pandas
numpy
matplotlib
seaborn
tqdm
Instructions to train the model can be found in cnn-final-ee708-project.ipynb
.
Run Evaluation_Test_Data.ipynb
to compute final metrics and confusion matrix.
This project demonstrates the effectiveness of CNNs for speech emotion recognition.The final model's architecture with depthwise separable convolutions provides a good balance between model complexity and performance.
Some emotions (like "angry" and "surprised") are recognized more accurately than others (like "neutral" and "calm")—aligning with human perception where stronger emotions are often easier to identify.
- Data augmentation techniques
- Exploring attention mechanisms
- Incorporating transformer-based architectures
- Fine-tuning hyperparameters
- Using pretrained audio feature extractors
Name | Roll no. | Email Id |
---|---|---|
Aritra Ambudh Dutta | 230191 | aritraad23@iitk.ac.in |
Archita Goyal | 230187 | architag23@iitk.ac.in |
Harshpreet Kaur | 230464 | harshpreet23@iitk.ac.in |
Suyash Kapoor | 231066 | suyashk23@iitk.ac.in |
Saksham Verma | 230899 | sakshamv@iitk.ac.in |
This Project was completed the Course Project of the course EE708 offered in Semester 2024-25/II at Indian Institute of Technology (IIT), Kanpur under Prof. Rajesh M. Hegde.