Skip to content

bertoramos/mnist_speech_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MNIST audio classification using Pytorch

Dataset

The original dataset is available in https://www.kaggle.com/datasets/sripaadsrinivasan/audio-mnist

CNN model

Model Summary

The following table shows the model architecture.

Model summary
Layer (type) Output Shape Param #
Conv2d-1 [-1, 10, 1023, 30] 260
ReLU-2 [-1, 10, 1023, 30] 0
MaxPool2d-3 [-1, 10, 511, 15] 0
Conv2d-4 [-1, 20, 509, 13] 5,020
ReLU-5 [-1, 20, 509, 13] 0
MaxPool2d-6 [-1, 20, 254, 6] 0
Flatten-7 [-1, 30480] 0
Linear-8 [-1, 3800] 115,827,800
ReLU-9 [-1, 3800] 0
Dropout-10 [-1, 3800] 0
Linear-11 [-1, 1024] 3,892,224
ReLU-12 [-1, 1024] 0
Dropout-13 [-1, 1024] 0
Linear-14 [-1, 256] 262,400
ReLU-15 [-1, 256] 0
Dropout-16 [-1, 256] 0
Linear-17 [-1, 10] 2,570
Total params: 119,990,274
Trainable params: 119,990,274
Non-trainable params: 0
Input size (MB): 0.13
Forward/backward pass size (MB): 7.87
Params size (MB): 457.73
Estimated Total Size (MB): 465.72

Results

Training parameters

Parameter Value
Loss function CrossEntroppyLoss
Optimizer Adam
Learning rate 10^-3
Audio sample rate 16 KHz
Dataset size 30000 samples
Dataset split 70% train / 20% validation / 10% test
Batch size 32
Epochs 15
Early stopping Patience = 3 epochs | Delta = 0.00 | Monitor = Accuracy
Device CUDA GPU (NVIDIA GeForce RTX 3050 Laptop GPU)

Evaluation Results

Metric Training Validation Test
Accuracy (%) 1.000 0.986 (best 0.991 at epoch 4) 0.9896
Loss 0.000139 0.0613 0.0458

Note: Effective training duration was 7 epochs — best validation at epoch 4, with early stopping patience set to 3 and the accuracy was 0.991.

Training Statistics

Training loss

Training loss

Training accuracy

Training accuracy

Validation loss

Validation loss

Validation accuracy

Validation accuracy

CNN + LSTM model

Model Summary

The following table shows the model architecture.

Model summary
Layer (type:depth-idx) Output Shape Param #
Conv2d: 2-1 [1, 20, 1025, 32] 200
ReLU: 2-2 [1, 20, 1025, 32] --
MaxPool2d: 2-3 [1, 20, 512, 16] --
LSTM: 1-2 [1, 4, 128] 21,038,080
Linear: 1-3 [1, 10] 1,290
Total params: 21,039,570
Trainable params: 21,039,570
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 90.71
Input size (MB): 0.13
Forward/backward pass size (MB): 5.25
Params size (MB): 84.16
Estimated Total Size (MB): 89.54

Results

Training parameters

Parameter Value
Loss function CrossEntroppyLoss
Optimizer Adam
Learning rate 10^-3
Audio sample rate 16 KHz
Dataset size 30000 samples
Dataset split 70% train / 20% validation / 10% test
Batch size 32
Epochs 15
Early stopping Patience = 3 epochs | Delta = 0.00 | Monitor = Accuracy
Device CUDA GPU (NVIDIA GeForce RTX 3050 Laptop GPU)

Evaluation Results

Metric Training Validation Test
Accuracy (%) 1.000 0.985 (best 0.991 at epoch 5) 0.9900
Loss 0.00266 0.0434 0.0347

Note: Effective training duration was 8 epochs — best validation at epoch 5, with early stopping patience set to 3 and the accuracy was 0.991.

Training Statistics

Training loss

Training loss

Training accuracy

Training accuracy

Validation loss

Validation loss

Validation accuracy

Validation accuracy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published