This repository includes a notebook that uses fine-tuning for a multi-output image classification task. The data comes from the Fashion Product Images dataset on Kaggle; you can find the dataset at this link.
You can also access my code for this project here, and the trained model here.
When aggregated, the dataset includes 44,446 rows and 12 columns. The columns include: id, image_path, gender, masterCategory, subCategory, articleType, baseColour, season, usage, brandName, variantName, and productDisplayName.
Among these, the last eight columns are considered the target features for our analysis.
The figure below illustrates a severe class imbalance across most target features:
We use Label Encoding for all target features due to its simplicity and memory efficiency.
We use EfficientNetB0 (excluding its top layer) as the backbone of our network, followed by a multi-head output—each head being a dense layer with softmax activation.
We first train only the top layers for 25 epochs, then fine-tune the entire network (including the backbone) for an additional 50 epochs. The hyperparameters used for fine-tuning are shown below:
Image Size | Optimizer & LR | Losses | Training Epochs | Patience for Early Stopping |
---|---|---|---|---|
(224, 224, 3) | Adam, LR_phase_1 = 5e-3, LR_phase_2 = 5e-5 | Sparse Cross Entropy Loss | epochs_phase_1 = 25, epochs_phase_2 = 50 | patience_phase_1 = 10, patience_phase_2 = 20 |
The following plot shows the training and validation loss over epochs:
The plots below show the accuracy and evaluation scores during training. (Scores are computed only on the validation set.)