Detect human emotions from raw .wav files using deep learning and pretrained speech models.
Built using Hugging Face Transformers, PyTorch, and Torchaudio.
This project classifies human emotions like Happy, Sad, Angry, and Neutral directly from raw audio (.wav
) files. It utilizes a Hugging Face pretrained model fine-tuned on the RAVDESS emotional speech dataset. The entire process—from preprocessing to training and inference—can be run using simple Python scripts.
-
Raw Audio Input
You input a mono.wav
file recorded at any sample rate. -
Preprocessing
The audio is normalized and resampled to 16kHz usingtorchaudio
. -
Feature Extraction
The pretrained HuBERT model (from Hugging Face) extracts deep audio embeddings. -
Classifier Head
A dense neural network is trained on top of these embeddings using labeled emotion data (RAVDESS). -
Prediction
The model outputs the most probable emotion class for the given voice input.
Emotion_detection_by_wave_formate/
├── audio_waveform_viewer.py # Displays waveform of audio files
├── preprocess_ravdess.py # Prepares RAVDESS dataset
├── train_emotion_model.py # Trains the classifier on HuBERT embeddings
├── predict_emotion.py # Predicts emotion for new audio input
├── model/
│ └── final_emotion_model/ # Trained model is saved here
├── dataset/ # Raw and processed audio files
├── sample_output/ # Store screenshots and predictions
├── requirements.txt # All required dependencies
└── README.md # You are here!
git clone https://github.com/dhanesh-j/Emotion_detection_by_wave_formate.git
cd Emotion_detection_by_wave_formate
pip install -r requirements.txt
Ensure RAVDESS dataset is available under dataset/RAVDESS/
python preprocess_ravdess.py --input_dir dataset/RAVDESS/ --output_dir dataset/processed/ --sample_rate 16000
python train_emotion_model.py --data_dir dataset/processed/ --pretrained_model superb/hubert-large-superb-er --output_dir model/final_emotion_model/
python predict_emotion.py --model_dir model/final_emotion_model/ --input_audio sample_output/sample.wav
record_and_predict.py
Here’s a visual summary of the model's performance across different emotions:
The graph shows high precision and recall for emotions like *Sad* and *Neutral*, with consistent accuracy across all categories. The above image represents the Emotion Detected after analysing the audio in realtime (live recorded audio). The above image represents the confusion matrix of the model The above image represents the Realtime Wave-Form of Recorded audio, measured by Time / Seconds.- Backbone:
superb/hubert-large-superb-er
- Dataset: RAVDESS
- Input: Raw waveform
- Output: Emotion class (
happy
,sad
,angry
,neutral
)
- ✅ No speech-to-text required
- ✅ Lightweight training using pretrained embeddings
- ✅ Easily extendable with other datasets
- ✅ Compatible with Gradio or Streamlit for UI
- Python 3.8+
- torch
- torchaudio
- transformers
- librosa
- matplotlib
Install them using:
pip install -r requirements.txt
- 🎤 Real-time microphone support
- 🌐 Multilingual emotion detection
- 📊 Interactive dashboard with waveform + prediction
- 🧪 Confusion matrix and training metrics visualization
Dhanesh J
Third-year mini project Computer Science student passionate about AI, voice recognition, and applied ML.