Skip to content

EstevesX10/ML2-Urban-Sound-Classification

Repository files navigation

ML2 | Urban Sound Classification


Project Overview

Sound Classification is considered one of the most important tasks in the field of deep learning. It has great impact on applications of voice recognition within virtual assistants (Like Siri or Alexa), customer service as well as in music and media content recommendation systems. Moreover, it also impacts the Medical field wheteher to detect abnormalities in heartbeats or repiratory sounds. In addition it is also used within Security and Surveillance systems to help detect and assess a possible security breach inside a home whether it is infered by distress calls or even gunshots and glass breaking. Therefore, we aim to develop deep learning algorithms that can enable us to properly classify some environmental sounds provided by the UrbanSound8k Dataset.

Project Development

Dependencies & Execution

This project was developed using a Notebook. Therefore if you're looking forward to test it out yourself, keep in mind to either use a Anaconda Distribution or a 3rd party software that helps you inspect and execute it.

Therefore, for more informations regarding the Virtual Environment used in Anaconda, consider checking the DEPENDENCIES.md file.

Planned Work

The project includes several key phases, including:

  1. Exploratory Data Analysis : We begin by examining the UrbanSound8k dataset to gain deeper insights into its structure and content to helps us understand the distribution of sound classes.
  2. Data pre-processing : Cleaning and Preparing the audio samples to ensure their consistency and quality over the .
  3. Feature Engineering : Utilizing the Librosa library, we extract meaningful features from the audio data such as Mel-frequency cepstral coefficients (MFCCs).
  4. Model architecture definition : We develop the architecture of artificial neural networks tailored for sound classification, which involves experimenting with different deep learning models.
  5. Training and Performance Evaluation : Employing the pre-partitioned dataset, we perform 10-fold cross-validation on each developed networks to then assess the models' performance using key metrics such as accuracy and confusion matrices.
  6. Statistical Inference : Perform a Statistical Evaluation of the performance between all the developed networks.

UrbanSound8K Dataset

The UrbanSound8k dataset contains 8732 labeled sound excerpts ($\le$ 4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy.

For a detailed description of the dataset please consider checking the dataset web page available here. In case you are interested in the compilation process, the dataset creators have published a paper outlining the Taxonomy for Urban Sound Research. You can access it here.

Additional Datasets

If you're interested in trying this project yourself, you'll need access to the complete datasets we've created. Since GitHub has file size limits, we've made them all available here.

Project Results

Model Performance

Network Architecture
Final Global Confusion Matrix

MLP

CNN

CNN Pre-Trained with YAMNET

ResNet

  • MLP: Achieved 45% accuracy, struggling with the complexity of the sound data.
  • CNN: Performed better at 55%, benefiting from 2D time-frequency representations (MFCCs).
  • YAMNet: Leveraging transfer learning, YAMNet outperformed other models with 70% accuracy.
  • ResNet: Achieved 55%, similar to CNN, but not as effective as YAMNet.

Dimensionality Reduction Visualization

Data Distribution Scatter Plots
1-Dimensional Processed MFCC's
2-Dimensional Raw MFCC's

PCA

t-SNE

Critical Differences Diagram

Critical Differences Diagram

This critical difference diagram shows the ranks of every model:

  1. YAMNET (1) is the best model, significantly outperforming others.
  2. CNN (2.6) and ResNet (2.6) have similar performance, with no statistical difference between them.
  3. MLP (3.8) is the worst, significantly worse than YAMNET and likely CNN / ResNet.

Conclusion

Our experiments showed that YAMNet with transfer learning produced the best results. Regularization techniques such as Dropout and L2 regularization helped reduce overfitting. While the models performed well overall, difficulties in distinguishing similar sound classes suggest that further improvements in feature extraction and model design could enhance performance.

Authorship

README.md by Gonçalo Esteves

About

Urban Sound Classification [Machine Learning II Course Project]

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •