Replies: 1 comment
-
@fsolgui SGD will not work well with BCE loss, you can usually improve things a bit w/ SGD by enabling.
SGD is also a bad choice for most vit and vit-like architectures regardless of the loss used. However, it's better to use an adaptive opt like Lamb (as per resnet strikes back), AdamW, NadamW, something newer like Kron / KronW (W means decoupled decay) when using BCE. Other considerations default smoothing of .1 is far too strong for BCE loss, I'd turn it off (set to 0) as ablations weren't conclusive that it was helpful. Sometimes using --bce-threshold can be helpful, 0.2 was used in RSB runs. I have definitely used mixup and cutmix with BCE loss though, but on imagenet. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am trying to fine-tune DeiT-tiny models over different small datasets such as CIFAR10, CIFAR100, Flowers and StanfordCars using the main
train.py
script with mixup and cutmix. I obtain really poor validation accuracy: 30% over CIFAR10 and less than 5% accuracy over the other three datasets when using Binary Cross Entropy (BCE) loss using the--bce-loss
flag. When using SoftTargetCrossEntropy loss i obtain results on par with published papers. I understand that BCE is not ideal with mixup/cutmix given that target labels are in the 0-1 range, but didn't expect such a gap in performance. I am using the next script to executetrain.py
:I am wondering if adding the
--bce-loss
flag requires to change anything else in the execution parameters or if the observed low accuracies are expected when using BCE with cutmix/mixup. If that's the case, i would like to know why in depthThanks in advance
Beta Was this translation helpful? Give feedback.
All reactions