Ever wondered how a machine can feel your vibes? Well, here's your answer! This project dives into the world of audio signals and extracts the hidden emotions behind them. This README provides an overview of the code, datasets used, and key functionalities.
- Installation
- Datasets
- Data Preparation
- Feature Extraction
- Spectrogram Visualization
- Model Training
- Results
- Contributing
- License
To run this project, you need to install the required libraries. Use the following commands to set up your environment:
!apt-get update
!apt-get install -y libsndfile1
pip install librosa seaborn tensorflow keras
The project uses the following datasets:
-
RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song): Contains 24 professional actors (12 male, 12 female) vocalizing two lexically-matched statements in a neutral North American accent.
-
CREMA-D (Crowd-Sourced Emotional Multimodal Actors Dataset): Contains 7,442 original clips from 91 actors.
-
SAVEE (Surrey Audio-Visual Expressed Emotion): Contains recordings from 4 male actors expressing 7 different emotions.
-
TESS (Toronto Emotional Speech Set): Contains 200 target words spoken in the carrier phrase "Say the word _" by two actresses.
The data preparation involves reading the audio files from the datasets and extracting relevant information such as file paths and emotions. The code performs the following steps:
- Mount Google Drive: To access the datasets stored in Google Drive.
- Load RAVDESS Dataset: Extracts file paths and emotions, and stores them in a DataFrame.
- Load CREMA-D Dataset: Extracts file paths and emotions, and stores them in a DataFrame.
- Load SAVEE Dataset: Extracts file paths and emotions, and stores them in a DataFrame.
- Load TESS Dataset: Extracts file paths and emotions, and stores them in a DataFrame.
- Combine All Datasets: Combines the DataFrames from all datasets into a single DataFrame.
Various audio features are extracted to represent the audio signals. The features include:
- RMS Energy: Root Mean Square energy of the audio signal.
- Zero Crossing Rate (ZCR): The rate at which the signal changes sign.
- Band Energy Ratio (BER): The ratio of energy in different frequency bands.
- Spectral Centroid: The center of mass of the spectrum.
- Bandwidth: The width of the band in the spectrum.
- Mel-Frequency Cepstral Coefficients (MFCCs): A representation of the short-term power spectrum of sound.
The project includes code for visualizing the spectrograms of audio signals. This includes:
- Magnitude Spectrum: The magnitude of the Fourier Transform of the signal.
- Spectrogram: A visual representation of the spectrum of frequencies of the signal as it varies with time.
- Log-Amplitude Spectrogram: The logarithm of the amplitude of the spectrogram.
- Mel Spectrogram: A spectrogram where the frequencies are converted to the Mel scale.
The project utilizes various machine learning models for emotion recognition from audio signals. The models include:
- Convolutional Neural Networks (CNNs): For extracting spatial features from spectrograms.
- Long Short-Term Memory (LSTM): For capturing temporal dependencies in the audio signals.
- GRU (Gated Recurrent Unit): An alternative to LSTM for capturing temporal dependencies.
The models are built using Keras and TensorFlow.
The project evaluates the performance of the models using metrics such as confusion matrix and classification report. The results show the accuracy and other performance metrics of the emotion recognition models.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License. See the LICENSE file for more details.