Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid gradient error when training on GQA with mcan_small #74

Closed
jeasinema opened this issue Aug 6, 2021 · 3 comments
Closed

Invalid gradient error when training on GQA with mcan_small #74

jeasinema opened this issue Aug 6, 2021 · 3 comments

Comments

@jeasinema
Copy link

Hi,

Thanks for this great project and the detailed doc. I really appreciate it.

I was trying to run some experiments on GQA with the default mcan_small model. I followed the instruction to prepare the GQA data and everything seemed to be working. However, when I launched the training with the following command

python3 run.py --RUN='train' --MODEL='mcan_small' --DATASET='gqa'

, I got the following error

[early log is omitted]
Loading validation set for per-epoch evaluation........
 ========== Dataset size: 12578
 ========== Question token vocab size: 2933
Max token length: 29 Trimmed to: 29
 ========== Answer token vocab size: 1843
Finished!

Initializing log file........
Finished!

[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Traceback (most recent call last):
  File "run.py", line 160, in <module>
    execution.run(__C.RUN_MODE)
  File "/local/xiaojianm/workspace/openvqa/utils/exec.py", line 33, in run
    train_engine(self.__C, self.dataset, self.dataset_eval)
  File "/local/xiaojianm/workspace/openvqa/utils/train_engine.py", line 192, in train_engine
    loss.backward()
  File "/local/xiaojianm/anaconda3/envs/default/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/local/xiaojianm/anaconda3/envs/default/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function MmBackward returned an invalid gradient at index 1 - got [6, 2048] but expected shape compatible with [5, 2048]

I'm wondering how can this issue happen and do you have any suggestions for debugging it?

I'm using torch==1.9.0 with spacy==2.3.7. Please feel free to ping me directly in this thread if more information is needed.

@jeasinema
Copy link
Author

[Update]

I get it run by going backing to commit 1180a59 with spacy==2.1.0.

@jeasinema
Copy link
Author

jeasinema commented Aug 7, 2021

[Update 2]

I found the issue seems to be originated from here. The GQA experiment runs fine after commenting this line out.

@MIL-VLG
Copy link
Collaborator

MIL-VLG commented Nov 16, 2021

Thanks! We have fixed this error.

@MIL-VLG MIL-VLG closed this as completed Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants