This project implements a full pipeline for preprocessing, modeling, and metric visualization for urban sound classification using PyTorch and the UrbanSound8K dataset. The pipeline handles audio files, applies data augmentation
techniques, and converts data into MEL spectrograms ready to feed into a CNN.
.
├── checkpoint/ # Checkpoints with models, metrics, scheduler and optimizer
│ └── train_and_val_metrics.png # Plot of accuracy and loss
├── data analysis/ # Notebook or scripts with data preprocessing analysis
├── src/ # Source code
│ ├── inference.py # Inference routine for trained models
│ ├── model.py # CNN architecture
│ ├── training.py # Training loop, validation, early stopping
│ └── utils.py # Dataset class, preprocessing functions
├── UrbanSound8K/ # Dataset folder
├── ForPrediction.py # Script for inference on new audio files
├── UrbanSound_Training.py # Main training routine
└── README.md
The goal of this project is to develop a classifier for urban sounds using convolutional neural networks (CNNs) with PyTorch. Sounds are extracted from the UrbanSound8K dataset, and the system can identify noise such as car horns, dog barks, sirens, and more, based on MEL spectrograms. The pipeline is complete: from raw audio loading to CNN training and results visualization.
- 🎵 Reading
.wav
files usingtorchaudio
- 🔁 Transformation into MelSpectrogram with configurable parameters
- 🧪 Application of SpecAugment (time and frequency masking)
- 🔢 Normalization of spectrogram data
- 🏷️ Conversion to tensors and label pairing
Each sample is standardized to a fixed input size, making CNN training consistent.
Custom dataset based on torch.utils.data.Dataset
and UrbanSound8K
:
- Reads from
metadata/UrbanSound8K.csv
- Uses the
fold
column to split training and validation sets - Lazy loading of
.wav
files - Spectrogram normalization and caching
- Conditional
data augmentation
only during training
Model training is handled by the Trainer
class, which includes:
- ✅ Support for early stopping and automatic checkpoints
- 📉 Calculation of metrics like loss, accuracy, recall, and F1
- 📝 Logs saved in
.json
formats - 📊 Automatic plotting of training curves (loss and accuracy)
- 🧪 Validation at the end of each epoch
CNN architecture includes:
- 🔹 4 convolutional blocks with
BatchNorm
,ReLU
,Dropout
- 🔹
MaxPooling
between blocks - 🔹
Flatten
+ fully connected layers - 🔹 Final
Softmax
layer for 10-class classification
Automatic visualizations after training:
- Metric logs saved per epoch as CSV files
- Visualization script in
utils/visualization.py
- Charts for:
- 🎯 Accuracy and loss per epoch
- 🔄 Execution time per epoch
Using the UrbanSound8K dataset:
- 🔊 8732 audio files (
.wav
) - 🏷️ 10 classes of urban sounds (e.g., siren, bark, car horn)
- 📁 Split into 10 folders (
fold1
tofold10
) - 🗂️ Metadata file
metadata/UrbanSound8K.csv
contains:slice_file_name
fold
classID
🔗 Download link: UrbanSound8K
Install the requirements with:
pip install -r requirements.txt
Key libraries:
- torch
- torchaudio
- scikit-learn
- matplotlib
- tqdm
- pandas
- numpy
- UrbanSound8K Dataset
- SpecAugment: Data Augmentation for ASR
- PyTorch
- Torchaudio Docs
- Scikit-learn Metrics
- Audio Deep Learning Made Simple - Ketan Doshi
Developed by Lucas Alves
📧 Email: alves_lucasoliveira@usp.br
🐙 GitHub: cyblx
💼 LinkedIn: cyblx