Skip to content
This repository has been archived by the owner on Feb 16, 2022. It is now read-only.

bug for ForwardModelsVal #69

Open
zongshenmu opened this issue Sep 25, 2020 · 7 comments
Open

bug for ForwardModelsVal #69

zongshenmu opened this issue Sep 25, 2020 · 7 comments

Comments

@zongshenmu
Copy link

why it happens that the input and target size of cross entropy loss are not matched?

Traceback (most recent call last):
  File "train_tasks.py", line 679, in <module>
    main()
  File "train_tasks.py", line 604, in main
    tbLogger,
  File "train_tasks.py", line 662, in evaluate
    args, task_cfg, device, task_id, batch, model, task_losses
  File "/data1/mzs/Code/vilbert-multi-task/vilbert/task_utils.py", line 155, in ForwardModelsVal
    loss = task_losses[task_id](vil_binary_prediction, target)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 601, in forward
    reduction=self.reduction)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/functional.py", line 2124, in binary_cross_entropy_with_logits
    raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([6, 2])) must be the same as input size (torch.Size([12, 2]))
@vedanuj
Copy link
Contributor

vedanuj commented Sep 25, 2020

Which task are you running on? Also please share your training command.

@zongshenmu
Copy link
Author

zongshenmu commented Sep 26, 2020

I pull your code from repository and only change the train batch size in the vilbert_tasks.yml. I run the Multi-task Training as you mentioned in the README.md but fail in the NLVR dataset.
I debug the dataset and find it is abnormal condition of the evaluation. It is always failed in the last iteration in the evaluation process of the model. When the batch data is fed in the model, the batch size between target data and the vil_binary_prediction data is not matched, which causes the task_losse, binary_cross_entropy_with_logits, cannot calculate right.

vil_binary_prediction torch.Size([12, 2]) target torch.Size([6, 2])
Traceback (most recent call last):
  File "train_tasks.py", line 682, in <module>
    main()
  File "train_tasks.py", line 605, in main
    tbLogger,
  File "train_tasks.py", line 665, in evaluate
    args, task_cfg, device, task_id, batch, model, task_losses
  File "/data1/mzs/Code/vilbert-multi-task/vilbert/task_utils.py", line 157, in ForwardModelsVal
    loss = task_losses[task_id](vil_binary_prediction, target)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 601, in forward
    reduction=self.reduction)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/functional.py", line 2124, in binary_cross_entropy_with_logits
    raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([6, 2])) must be the same as input size (torch.Size([12, 2]))

I also find it is possibly not fit for 1688-1691 lines of the code in the vilbert.py. When the batch size is odd, it does not run.

 if pooled_output.size(0) % 2 == 0:
            vil_binary_prediction = self.vil_binary_prediction(
                pooled_output.view(-1, pooled_output.size(1) * 2)
            )

@zongshenmu
Copy link
Author

Your code in eval_tasks.py for retrieval_datasets.py is also wrong. Dataloader cannot analyze the right batch size and num options:

    elif task_cfg[task_id]["process"] in ["retrieval"]:
        max_num_bbox = features.size(1)
        num_options = question.size(1)
       
        features = features.view(-1, features.size(2), features.size(3))
        spatials = spatials.view(-1, spatials.size(2), spatials.size(3))
        image_mask = image_mask.view(-1, image_mask.size(2))
        question = question.view(-1, question.size(2))
        input_mask = input_mask.view(-1, input_mask.size(2))
        segment_ids = segment_ids.view(-1, segment_ids.size(2))
        co_attention_mask = co_attention_mask.view(
            -1, co_attention_mask.size(2), co_attention_mask.size(3)
        )

Can you provide your retrieval evaluation metric?

@ZhiyuanChen
Copy link

Code quality of this repo is surprisingly low. I couldn't believe this is an engineering of facebook. It would take much less effort to just rewrite than debug.

@chen398936790
Copy link

Dear all,

I met the same problem when I tried to run multi-task learning with the command in README.md.

The training had already continued for hours and I can even see a validation on refcocog was done in iter 513.
But the code returned the same ValueError at iter 661 at last.

Do you find any solution to that? Or are there any ideas to solve this problem?

Thank you!

@enaserianhanzaei
Copy link

it is three weeks that I'm trying to run this code, it's unbelievable how full of bug it is, I'm just wondering if the results they reported are actually true.

@enaserianhanzaei
Copy link

@zongshenmu @vedanuj @chen398936790 @ZhiyuanChen

I wrote a step-by-step tutorial on how to set up the environment, train and test this model. I also added a section on extracting the visiolinguistic embeddings from the image-text data.
https://naserian-elahe.medium.com/vilbert-a-model-for-learning-joint-representations-of-image-content-and-natural-language-47f56a313a79
I very much appreciate any comments or suggestions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants