You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I got another two problems when I carried out experiments on Avazu dataset:
When I do pretraining step, it always shows negative loss which is somehow strange even though it still decreases.
08/11 02:07:09 PM client generated: 2
08/11 02:07:09 PM Cross-Party Train Epoch 0, training on aligned data, LR: 0.1, sample: 16384
08/11 02:07:10 PM Cross-Party SSL Train Epoch 0, client loss aligned: [-0.16511965772951953, -0.152420010213973]
08/11 02:07:10 PM Local SSL Train Epoch 0, training on local data, sample: 80384
08/11 02:07:22 PM Local SSL Train Epoch 0, client loss local: [-0.5874887084815307, -0.5748279593279881]
08/11 02:07:22 PM Local SSL Train Epoch 0, AGG MODE pma, client loss agg: []
08/11 02:07:24 PM ###### Valid Epoch 0 Start #####
08/11 02:07:24 PM Valid Epoch 0, valid client loss aligned: [-0.3176240861415863, -0.22815129309892654]
08/11 02:07:24 PM Valid Epoch 0, valid client loss local: [-0.22939987406134604, -0.22190943509340286]
08/11 02:07:24 PM Valid Epoch 0, valid client loss regularized: [0.0, 0.0]
08/11 02:07:24 PM Valid Epoch 0, Loss_aligned -0.273 Loss_local -0.226
when I do the finetune step, it shows error information below:
File "/data/nfs/user/liwg/vfl/fedhssl/FedHSSL/models/model_templates.py", line 206, in load_encoder_cross
self.encoder_cross.load_state_dict(torch.load(load_path, map_location=device))
File "/data/nfs/miniconda/envs/liwg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DNNFM:
size mismatch for embedding_dict.device_ip.weight: copying a param with shape torch.Size([70769, 32]) from checkpoint, the shape in current model is torch.Size([70768, 32]).
size mismatch for embedding_dict.device_model.weight: copying a param with shape torch.Size([3066, 32]) from checkpoint, the shape in current model is torch.Size([3065, 32]).
size mismatch for embedding_dict.C14.weight: copying a param with shape torch.Size([1699, 32]) from checkpoint, the shape in current model is torch.Size([1698, 32]).
The pretrained encoder_cross weight is one dimension larger than the expected.
The text was updated successfully, but these errors were encountered:
the sign of the loss depends on the specific loss you have used.
it seems there is a mismatch between the input data and the model parameters, you could check the format of the data you used for pretrain and finetune.
I found there are two places needed to be carefully checked:
In line 214 prepare_experiments.py, "train_dataset_aug" is not defined when exp_type=='cls' which should be the cases for vanilla classification or finetuning step.
In line 197 ctr_dataset.py, why "data[feat].nunique()+1" for feature_columns in AvazuAug2party, comparing to "data[feat].nunique()" in Avazu2party. This actually causes the aforementioned dimension mismatch problem rather than the format of data.
Hello,
I got another two problems when I carried out experiments on Avazu dataset:
08/11 02:07:09 PM client generated: 2
08/11 02:07:09 PM Cross-Party Train Epoch 0, training on aligned data, LR: 0.1, sample: 16384
08/11 02:07:10 PM Cross-Party SSL Train Epoch 0, client loss aligned: [-0.16511965772951953, -0.152420010213973]
08/11 02:07:10 PM Local SSL Train Epoch 0, training on local data, sample: 80384
08/11 02:07:22 PM Local SSL Train Epoch 0, client loss local: [-0.5874887084815307, -0.5748279593279881]
08/11 02:07:22 PM Local SSL Train Epoch 0, AGG MODE pma, client loss agg: []
08/11 02:07:24 PM ###### Valid Epoch 0 Start #####
08/11 02:07:24 PM Valid Epoch 0, valid client loss aligned: [-0.3176240861415863, -0.22815129309892654]
08/11 02:07:24 PM Valid Epoch 0, valid client loss local: [-0.22939987406134604, -0.22190943509340286]
08/11 02:07:24 PM Valid Epoch 0, valid client loss regularized: [0.0, 0.0]
08/11 02:07:24 PM Valid Epoch 0, Loss_aligned -0.273 Loss_local -0.226
File "/data/nfs/user/liwg/vfl/fedhssl/FedHSSL/models/model_templates.py", line 206, in load_encoder_cross
self.encoder_cross.load_state_dict(torch.load(load_path, map_location=device))
File "/data/nfs/miniconda/envs/liwg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DNNFM:
size mismatch for embedding_dict.device_ip.weight: copying a param with shape torch.Size([70769, 32]) from checkpoint, the shape in current model is torch.Size([70768, 32]).
size mismatch for embedding_dict.device_model.weight: copying a param with shape torch.Size([3066, 32]) from checkpoint, the shape in current model is torch.Size([3065, 32]).
size mismatch for embedding_dict.C14.weight: copying a param with shape torch.Size([1699, 32]) from checkpoint, the shape in current model is torch.Size([1698, 32]).
The pretrained encoder_cross weight is one dimension larger than the expected.
The text was updated successfully, but these errors were encountered: