In this project, we designed a system that takes a spoken sentence and segments it into words 🗣️✂️.
The system detects when each word starts and ends, without knowing in advance how many words there are — only assuming a small silence gap between words.
Additionally, we created a program to play each detected word separately.
Finally, the system estimates the average pitch (fundamental frequency) of the speaker.
The following classifiers were trained and evaluated:
- Least Squares (LSQ)
- Support Vector Machine (SVM)
- Multilayer Perceptron (MLP) (Three-layer neural network)
- Recurrent Neural Network (RNN)
- Programming Language: Python 3.12.4 🐍
- ❗ No CNNs, no web services, no transfer learning allowed.
- Deliverables: PDF documentation, source code (source2023.zip), auxiliary files (auxiliary2023.zip).
The system does binary classification:
✅ Speech (foreground) vs ❌ Non-speech (background).
Main Steps:
- Extract Mel spectrograms 🎶 from sliding windows of the audio.
- Classify each window as speech or non-speech.
- Apply a median filter to smooth out small errors 🧹.
- Find the boundaries between words based on the cleaned-up predictions 🧩.
- Simple models that output a continuous value.
- Then we threshold them to get binary speech/non-speech predictions.
- Two hidden layers: 128 and 64 neurons with ReLU activation ⚡.
- Output layer: Single neuron with Sigmoid activation 🧠.
- Trained with Binary Cross-Entropy Loss.
- Built using TensorFlow's SimpleRNN layers 🔁.
- Processes sequences of frames to capture temporal dynamics.
- Outputs one probability per time frame.
Foreground (Speech) 🗣️:
- Common Voice Corpus Delta Segment 18.0 (as of 6/19/2024) 📚.
Background (Noise / Non-speech) 🔇:
- ESC-50 dataset (Harvard Dataverse) 🎧.
(Selected only ~150 folders to keep things manageable.)
- 🧑🏫 MLP: trained with Dropout layers to prevent overfitting.
- 🛡️ SVM: trained with LinearSVC from scikit-learn.
- 🔢 LSQ: trained using simple matrix operations.
- 🔁 RNN: trained using SimpleRNN layers to model sequence data.
- 🎯 Tested on three WAV files: 5 seconds, 10 seconds, 20 seconds.
- 📜 Each test file has:
.txt
file with ground-truth words..json
file with ground-truth timestamps.
Testing Process:
- Load the test WAV file.
- Extract Mel spectrogram features.
- Predict using all models.
- Post-process with median filtering.
- Detect speech segments.
- Compare predictions to ground-truth annotations 📝.
📄 File | 📜 Description |
---|---|
train.py |
Training script for all models |
test.py |
Testing and evaluation script |
os
🗂️: File operationsnumpy
➗: Math operationsjson
📄: Handling annotation fileslibrosa
🎶: Audio processingjoblib
💾: Model saving/loadingscikit-learn
📚: ML algorithms (SVM, preprocessing)tensorflow.keras
🤖: Neural networks (MLP, RNN)
load_train_audio_clips(limit=None)
: Load training audio.extract_features(audio_clip)
: Get Mel spectrograms.pad_features(features, expected_frames)
: Pad/truncate features.- Train and save models (MLP, SVM, LSQ, RNN).
- Load audio and ground-truth.
- Predict frame-by-frame speech probability.
- Smooth with median filter.
- Detect segments and compare results.
This project built a speech segmentation system that works without prior word count knowledge.
It uses traditional machine learning and simple RNNs, without heavy neural network models like CNNs, or any external APIs 🌐🚫.