A simple yet effective bidirectional LSTM model trained to recognize spoken digits using spectrograms. This project serves as an end-to-end learning exercise in audio preprocessing, feature extraction, and sequence modeling.
- 🔊 Audio Classification: Predicts digits (0–9) from short spoken audio clips.
- 📈 High Accuracy: Achieves strong validation performance with minimal preprocessing.
- 🧠 Deep Learning: Utilizes a Bidirectional LSTM model trained on spectrogram features.
- 🔁 Augmentation: Applies time-stretching and noise injection to improve generalization.
- 📊 Visualization: Includes confusion matrix, spectrogram plot, and training curves.
- 🎓 Educational Purpose: Built as a foundational step into speech and audio modeling.
This project uses the Free Spoken Digit Dataset (FSDD), which contains:
- Recordings of digits (0–9)
- Multiple speakers
- Clean and well-labeled audio, ideal for quick experimentation
- The model performs well on validation data with minimal overfitting.
- Confusion matrix shows strong classification accuracy, especially for clearly spoken digits.
- Achieved 96% F1 score
This project shows that even simple models can be powerful when combined with clean datasets and good preprocessing. Bidirectional LSTMs capture temporal features well, and augmentation helps further boost performance. The approach provides a solid foundation for more complex speech-based applications.
- Key Packages:
- Python 3.10
- tensorflow==2.19.0
- tf-keras==2.19.0
- keras==3.9.2
- pandas==1.4.2
- numpy==1.26.4
- matplotlib==3.10.0
- seaborn==0.13.2
- librosa==0.11.0