Binary Cross Entropy loss with MixUp/Cutmix leads to poor performance #2479

fsolgui · 2025-04-30T12:08:09Z

fsolgui
Apr 30, 2025

Hi,

I am trying to fine-tune DeiT-tiny models over different small datasets such as CIFAR10, CIFAR100, Flowers and StanfordCars using the main train.py script with mixup and cutmix. I obtain really poor validation accuracy: 30% over CIFAR10 and less than 5% accuracy over the other three datasets when using Binary Cross Entropy (BCE) loss using the --bce-loss flag. When using SoftTargetCrossEntropy loss i obtain results on par with published papers. I understand that BCE is not ideal with mixup/cutmix given that target labels are in the 0-1 range, but didn't expect such a gap in performance. I am using the next script to execute train.py:

#!/bin/bash
MODEL="deit_tiny_patch16_224"
EPOCHS=500
WARMUP_EPOCHS=5
LR=0.01
OPT="sgd"
WEIGHT_DECAY=1e-4
BATCH_SIZE=384
AA="rand-m9-mstd0.5-inc1"
AUG_REPEATS=3
MIXUP=0.8
CUTMIX=1.0
SMOOTHING=0.1

torchrun --nproc_per_node=2 train.py \
    --data-dir dataset --val-split test -j 1 \
    --model $MODEL --pretrained --num-classes 196 \
    --epochs $EPOCHS --warmup-epochs $WARMUP_EPOCHS --lr $LR --opt $OPT --weight-decay $WEIGHT_DECAY -b $BATCH_SIZE \
    --aa $AA --aug-repeats $AUG_REPEATS --mixup $MIXUP --cutmix $CUTMIX --smoothing $SMOOTHING --bce-loss

I am wondering if adding the --bce-loss flag requires to change anything else in the execution parameters or if the observed low accuracies are expected when using BCE with cutmix/mixup. If that's the case, i would like to know why in depth

Thanks in advance

rwightman · 2025-05-02T15:18:35Z

rwightman
May 2, 2025
Maintainer

@fsolgui SGD will not work well with BCE loss, you can usually improve things a bit w/ SGD by enabling.

group.add_argument('--bce-sum', action='store_true', default=False,
                   help='Sum over classes when using BCE loss.')

SGD is also a bad choice for most vit and vit-like architectures regardless of the loss used.

However, it's better to use an adaptive opt like Lamb (as per resnet strikes back), AdamW, NadamW, something newer like Kron / KronW (W means decoupled decay) when using BCE.

Other considerations default smoothing of .1 is far too strong for BCE loss, I'd turn it off (set to 0) as ablations weren't conclusive that it was helpful. Sometimes using --bce-threshold can be helpful, 0.2 was used in RSB runs. I have definitely used mixup and cutmix with BCE loss though, but on imagenet.

0 replies

fsolgui · 2025-05-05T08:58:49Z

fsolgui
May 5, 2025
Author

Thanks for the tips!!!

I kind of now imagined that SGD is a bad choice for vit-like architectures. I was following recipe from TouvronHugo on deit. With your advice in mind, I will also test with AdamW and lower learning rate and analyse results!!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Binary Cross Entropy loss with MixUp/Cutmix leads to poor performance #2479

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Binary Cross Entropy loss with MixUp/Cutmix leads to poor performance #2479

Uh oh!

Uh oh!

fsolgui Apr 30, 2025

Replies: 2 comments

Uh oh!

rwightman May 2, 2025 Maintainer

Uh oh!

fsolgui May 5, 2025 Author

fsolgui
Apr 30, 2025

rwightman
May 2, 2025
Maintainer

fsolgui
May 5, 2025
Author