You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey,
While i was training ghn and mlp models, at around 220 epochs, i had the following error: error <class 'RuntimeError'> the loss is nan, unable to proceed.
Do you have any solution for this?
Error Message:
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
Out of patience (after 15 attempts to continue), please restart the job with another seed !!!
Traceback (most recent call last):
File "/ppuda/experiments/train_ghn.py", line 168, in
main()
File "/ppuda/experiments/train_ghn.py", line 105, in main
loss = trainer.update(nets_torch, images, targets, ghn=ghn, graphs=graphs)
File "/ppuda/../ppuda/ppuda/utils/trainer.py", line 101, in update
raise RuntimeError('the loss is {}, unable to proceed'.format(loss))
RuntimeError: the loss is nan, unable to proceed
The text was updated successfully, but these errors were encountered:
A simple solution is to restart the job and load from the saved GHN checkpoint. I created a pull request #5, where I added the code to load the existing GHN checkpoint and resume training.
Let me know if this does not help. Otherwise, feel free to close the issue.
Hey,
While i was training ghn and mlp models, at around 220 epochs, i had the following error: error <class 'RuntimeError'> the loss is nan, unable to proceed.
Do you have any solution for this?
Error Message:
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
error <class 'RuntimeError'> the loss is nan, unable to proceed
Out of patience (after 15 attempts to continue), please restart the job with another seed !!!
Traceback (most recent call last):
File "/ppuda/experiments/train_ghn.py", line 168, in
main()
File "/ppuda/experiments/train_ghn.py", line 105, in main
loss = trainer.update(nets_torch, images, targets, ghn=ghn, graphs=graphs)
File "/ppuda/../ppuda/ppuda/utils/trainer.py", line 101, in update
raise RuntimeError('the loss is {}, unable to proceed'.format(loss))
RuntimeError: the loss is nan, unable to proceed
The text was updated successfully, but these errors were encountered: