Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

negative loss and mismatched dimension when load pretrained weights #2

Open
wg-li opened this issue Aug 11, 2023 · 2 comments
Open

Comments

@wg-li
Copy link

wg-li commented Aug 11, 2023

Hello,

I got another two problems when I carried out experiments on Avazu dataset:

  1. When I do pretraining step, it always shows negative loss which is somehow strange even though it still decreases.

08/11 02:07:09 PM client generated: 2
08/11 02:07:09 PM Cross-Party Train Epoch 0, training on aligned data, LR: 0.1, sample: 16384
08/11 02:07:10 PM Cross-Party SSL Train Epoch 0, client loss aligned: [-0.16511965772951953, -0.152420010213973]
08/11 02:07:10 PM Local SSL Train Epoch 0, training on local data, sample: 80384
08/11 02:07:22 PM Local SSL Train Epoch 0, client loss local: [-0.5874887084815307, -0.5748279593279881]
08/11 02:07:22 PM Local SSL Train Epoch 0, AGG MODE pma, client loss agg: []
08/11 02:07:24 PM ###### Valid Epoch 0 Start #####
08/11 02:07:24 PM Valid Epoch 0, valid client loss aligned: [-0.3176240861415863, -0.22815129309892654]
08/11 02:07:24 PM Valid Epoch 0, valid client loss local: [-0.22939987406134604, -0.22190943509340286]
08/11 02:07:24 PM Valid Epoch 0, valid client loss regularized: [0.0, 0.0]
08/11 02:07:24 PM Valid Epoch 0, Loss_aligned -0.273 Loss_local -0.226

  1. when I do the finetune step, it shows error information below:

File "/data/nfs/user/liwg/vfl/fedhssl/FedHSSL/models/model_templates.py", line 206, in load_encoder_cross
self.encoder_cross.load_state_dict(torch.load(load_path, map_location=device))
File "/data/nfs/miniconda/envs/liwg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DNNFM:
size mismatch for embedding_dict.device_ip.weight: copying a param with shape torch.Size([70769, 32]) from checkpoint, the shape in current model is torch.Size([70768, 32]).
size mismatch for embedding_dict.device_model.weight: copying a param with shape torch.Size([3066, 32]) from checkpoint, the shape in current model is torch.Size([3065, 32]).
size mismatch for embedding_dict.C14.weight: copying a param with shape torch.Size([1699, 32]) from checkpoint, the shape in current model is torch.Size([1698, 32]).

The pretrained encoder_cross weight is one dimension larger than the expected.

@jorghyq2016
Copy link
Owner

Hi,

  1. the sign of the loss depends on the specific loss you have used.

  2. it seems there is a mismatch between the input data and the model parameters, you could check the format of the data you used for pretrain and finetune.

Best.

@wg-li
Copy link
Author

wg-li commented Aug 21, 2023

Hello,

I found there are two places needed to be carefully checked:

  1. In line 214 prepare_experiments.py, "train_dataset_aug" is not defined when exp_type=='cls' which should be the cases for vanilla classification or finetuning step.
  2. In line 197 ctr_dataset.py, why "data[feat].nunique()+1" for feature_columns in AvazuAug2party, comparing to "data[feat].nunique()" in Avazu2party. This actually causes the aforementioned dimension mismatch problem rather than the format of data.

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants