Skip to content

negative loss and mismatched dimension when load pretrained weights #2

@wg-li

Description

@wg-li

Hello,

I got another two problems when I carried out experiments on Avazu dataset:

  1. When I do pretraining step, it always shows negative loss which is somehow strange even though it still decreases.

08/11 02:07:09 PM client generated: 2
08/11 02:07:09 PM Cross-Party Train Epoch 0, training on aligned data, LR: 0.1, sample: 16384
08/11 02:07:10 PM Cross-Party SSL Train Epoch 0, client loss aligned: [-0.16511965772951953, -0.152420010213973]
08/11 02:07:10 PM Local SSL Train Epoch 0, training on local data, sample: 80384
08/11 02:07:22 PM Local SSL Train Epoch 0, client loss local: [-0.5874887084815307, -0.5748279593279881]
08/11 02:07:22 PM Local SSL Train Epoch 0, AGG MODE pma, client loss agg: []
08/11 02:07:24 PM ###### Valid Epoch 0 Start #####
08/11 02:07:24 PM Valid Epoch 0, valid client loss aligned: [-0.3176240861415863, -0.22815129309892654]
08/11 02:07:24 PM Valid Epoch 0, valid client loss local: [-0.22939987406134604, -0.22190943509340286]
08/11 02:07:24 PM Valid Epoch 0, valid client loss regularized: [0.0, 0.0]
08/11 02:07:24 PM Valid Epoch 0, Loss_aligned -0.273 Loss_local -0.226

  1. when I do the finetune step, it shows error information below:

File "/data/nfs/user/liwg/vfl/fedhssl/FedHSSL/models/model_templates.py", line 206, in load_encoder_cross
self.encoder_cross.load_state_dict(torch.load(load_path, map_location=device))
File "/data/nfs/miniconda/envs/liwg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DNNFM:
size mismatch for embedding_dict.device_ip.weight: copying a param with shape torch.Size([70769, 32]) from checkpoint, the shape in current model is torch.Size([70768, 32]).
size mismatch for embedding_dict.device_model.weight: copying a param with shape torch.Size([3066, 32]) from checkpoint, the shape in current model is torch.Size([3065, 32]).
size mismatch for embedding_dict.C14.weight: copying a param with shape torch.Size([1699, 32]) from checkpoint, the shape in current model is torch.Size([1698, 32]).

The pretrained encoder_cross weight is one dimension larger than the expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions