Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime Error: Loss is nan #4

Open
MichailChatzianastasis opened this issue Feb 3, 2022 · 1 comment
Open

Runtime Error: Loss is nan #4

MichailChatzianastasis opened this issue Feb 3, 2022 · 1 comment

Comments

@MichailChatzianastasis
Copy link

Hey,
While i was training ghn and mlp models, at around 220 epochs, i had the following error: error <class 'RuntimeError'> the loss is nan, unable to proceed.
Do you have any solution for this?

Error Message:
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
Out of patience (after 15 attempts to continue), please restart the job with another seed !!!
Traceback (most recent call last):
File "/ppuda/experiments/train_ghn.py", line 168, in
main()
File "/ppuda/experiments/train_ghn.py", line 105, in main
loss = trainer.update(nets_torch, images, targets, ghn=ghn, graphs=graphs)
File "/ppuda/../ppuda/ppuda/utils/trainer.py", line 101, in update
raise RuntimeError('the loss is {}, unable to proceed'.format(loss))
RuntimeError: the loss is nan, unable to proceed

@bknyaz
Copy link
Contributor

bknyaz commented Feb 3, 2022

A simple solution is to restart the job and load from the saved GHN checkpoint. I created a pull request #5, where I added the code to load the existing GHN checkpoint and resume training.

Let me know if this does not help. Otherwise, feel free to close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants