The original dataset is available in https://www.kaggle.com/datasets/sripaadsrinivasan/audio-mnist
The following table shows the model architecture.
Layer (type) | Output Shape | Param # |
---|---|---|
Conv2d-1 | [-1, 10, 1023, 30] | 260 |
ReLU-2 | [-1, 10, 1023, 30] | 0 |
MaxPool2d-3 | [-1, 10, 511, 15] | 0 |
Conv2d-4 | [-1, 20, 509, 13] | 5,020 |
ReLU-5 | [-1, 20, 509, 13] | 0 |
MaxPool2d-6 | [-1, 20, 254, 6] | 0 |
Flatten-7 | [-1, 30480] | 0 |
Linear-8 | [-1, 3800] | 115,827,800 |
ReLU-9 | [-1, 3800] | 0 |
Dropout-10 | [-1, 3800] | 0 |
Linear-11 | [-1, 1024] | 3,892,224 |
ReLU-12 | [-1, 1024] | 0 |
Dropout-13 | [-1, 1024] | 0 |
Linear-14 | [-1, 256] | 262,400 |
ReLU-15 | [-1, 256] | 0 |
Dropout-16 | [-1, 256] | 0 |
Linear-17 | [-1, 10] | 2,570 |
Total params: 119,990,274 Trainable params: 119,990,274 Non-trainable params: 0 Input size (MB): 0.13 Forward/backward pass size (MB): 7.87 Params size (MB): 457.73 Estimated Total Size (MB): 465.72 |
Parameter | Value |
---|---|
Loss function | CrossEntroppyLoss |
Optimizer | Adam |
Learning rate | 10^-3 |
Audio sample rate | 16 KHz |
Dataset size | 30000 samples |
Dataset split | 70% train / 20% validation / 10% test |
Batch size | 32 |
Epochs | 15 |
Early stopping | Patience = 3 epochs | Delta = 0.00 | Monitor = Accuracy |
Device | CUDA GPU (NVIDIA GeForce RTX 3050 Laptop GPU) |
Metric | Training | Validation | Test |
---|---|---|---|
Accuracy (%) | 1.000 | 0.986 (best 0.991 at epoch 4) | 0.9896 |
Loss | 0.000139 | 0.0613 | 0.0458 |
Training loss Training accuracy Validation loss Validation accuracyNote: Effective training duration was 7 epochs — best validation at epoch 4, with early stopping patience set to 3 and the accuracy was 0.991.
The following table shows the model architecture.
Layer (type:depth-idx) | Output Shape | Param # |
---|---|---|
Conv2d: 2-1 | [1, 20, 1025, 32] | 200 |
ReLU: 2-2 | [1, 20, 1025, 32] | -- |
MaxPool2d: 2-3 | [1, 20, 512, 16] | -- |
LSTM: 1-2 | [1, 4, 128] | 21,038,080 |
Linear: 1-3 | [1, 10] | 1,290 |
Total params: 21,039,570 Trainable params: 21,039,570 Non-trainable params: 0 Total mult-adds (Units.MEGABYTES): 90.71 |
||
Input size (MB): 0.13 Forward/backward pass size (MB): 5.25 Params size (MB): 84.16 Estimated Total Size (MB): 89.54 |
Parameter | Value |
---|---|
Loss function | CrossEntroppyLoss |
Optimizer | Adam |
Learning rate | 10^-3 |
Audio sample rate | 16 KHz |
Dataset size | 30000 samples |
Dataset split | 70% train / 20% validation / 10% test |
Batch size | 32 |
Epochs | 15 |
Early stopping | Patience = 3 epochs | Delta = 0.00 | Monitor = Accuracy |
Device | CUDA GPU (NVIDIA GeForce RTX 3050 Laptop GPU) |
Metric | Training | Validation | Test |
---|---|---|---|
Accuracy (%) | 1.000 | 0.985 (best 0.991 at epoch 5) | 0.9900 |
Loss | 0.00266 | 0.0434 | 0.0347 |
Training loss Training accuracy Validation loss Validation accuracyNote: Effective training duration was 8 epochs — best validation at epoch 5, with early stopping patience set to 3 and the accuracy was 0.991.