negative loss and mismatched dimension when load pretrained weights #2

wg-li · 2023-08-11T06:23:35Z

Hello,

I got another two problems when I carried out experiments on Avazu dataset:

When I do pretraining step, it always shows negative loss which is somehow strange even though it still decreases.

08/11 02:07:09 PM client generated: 2
08/11 02:07:09 PM Cross-Party Train Epoch 0, training on aligned data, LR: 0.1, sample: 16384
08/11 02:07:10 PM Cross-Party SSL Train Epoch 0, client loss aligned: [-0.16511965772951953, -0.152420010213973]
08/11 02:07:10 PM Local SSL Train Epoch 0, training on local data, sample: 80384
08/11 02:07:22 PM Local SSL Train Epoch 0, client loss local: [-0.5874887084815307, -0.5748279593279881]
08/11 02:07:22 PM Local SSL Train Epoch 0, AGG MODE pma, client loss agg: []
08/11 02:07:24 PM ###### Valid Epoch 0 Start #####
08/11 02:07:24 PM Valid Epoch 0, valid client loss aligned: [-0.3176240861415863, -0.22815129309892654]
08/11 02:07:24 PM Valid Epoch 0, valid client loss local: [-0.22939987406134604, -0.22190943509340286]
08/11 02:07:24 PM Valid Epoch 0, valid client loss regularized: [0.0, 0.0]
08/11 02:07:24 PM Valid Epoch 0, Loss_aligned -0.273 Loss_local -0.226

when I do the finetune step, it shows error information below:

File "/data/nfs/user/liwg/vfl/fedhssl/FedHSSL/models/model_templates.py", line 206, in load_encoder_cross
self.encoder_cross.load_state_dict(torch.load(load_path, map_location=device))
File "/data/nfs/miniconda/envs/liwg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DNNFM:
size mismatch for embedding_dict.device_ip.weight: copying a param with shape torch.Size([70769, 32]) from checkpoint, the shape in current model is torch.Size([70768, 32]).
size mismatch for embedding_dict.device_model.weight: copying a param with shape torch.Size([3066, 32]) from checkpoint, the shape in current model is torch.Size([3065, 32]).
size mismatch for embedding_dict.C14.weight: copying a param with shape torch.Size([1699, 32]) from checkpoint, the shape in current model is torch.Size([1698, 32]).

The pretrained encoder_cross weight is one dimension larger than the expected.

jorghyq2016 · 2023-08-15T02:53:08Z

Hi,

the sign of the loss depends on the specific loss you have used.
it seems there is a mismatch between the input data and the model parameters, you could check the format of the data you used for pretrain and finetune.

Best.

wg-li · 2023-08-21T02:21:20Z

Hello,

I found there are two places needed to be carefully checked:

In line 214 prepare_experiments.py, "train_dataset_aug" is not defined when exp_type=='cls' which should be the cases for vanilla classification or finetuning step.
In line 197 ctr_dataset.py, why "data[feat].nunique()+1" for feature_columns in AvazuAug2party, comparing to "data[feat].nunique()" in Avazu2party. This actually causes the aforementioned dimension mismatch problem rather than the format of data.

Best

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

negative loss and mismatched dimension when load pretrained weights #2

negative loss and mismatched dimension when load pretrained weights #2

wg-li commented Aug 11, 2023

jorghyq2016 commented Aug 15, 2023

wg-li commented Aug 21, 2023

negative loss and mismatched dimension when load pretrained weights #2

negative loss and mismatched dimension when load pretrained weights #2

Comments

wg-li commented Aug 11, 2023

jorghyq2016 commented Aug 15, 2023

wg-li commented Aug 21, 2023