-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss NaN, incorrect value Nan or Inf in input Tensor #27
Comments
Trying smaller learning rate. The new model has not tested. |
@huaifeng1993 Hi, I use the default XceptionA as backbone network without pre-trained weigths, as the pre-training costs to much time. I try to decrease the learning rate to 1e-2, 1e-3 and 1e-4, nothing changed, and I also print out the information of input images and labels, which didn't have any inf or nan value, but the output of DFANet had. |
what is time you downloaded the dfanet.py. there is a version that has the problem you meet. checking the dfanet.py see if the code same as what you downloaded. |
@huaifeng1993 Yep, loss decreases in a right way. But I find the usage of GPU is too low, and the speed of inference is only 21 fps with default resolution on Tesla V100. What is the key limitation do you think, is the CPU I/O or the number of workers? |
i think i/o takes a lot of time.But you can try loading data into gpu previously when you test the model inference's speed. |
hellow, can you share your XceptionA backbone? |
Hi, @huaifeng1993 , I try to train DFANet on cityscapes dataset, but the loss directly became Nan, like this:
Epoch 0/1499
step: 0/298 | loss: 910.1328 | IoU of batch: 0.0249
step: 1/298 | loss: 2907000799232.0000 | IoU of batch: 0.0004
step: 2/298 | loss: nan | IoU of batch: 0.0303
step: 3/298 | loss: nan | IoU of batch: 0.0233
I wonder if there are special processes for training data, or other solutions to solve this proplem.
Thanks a lot.
The text was updated successfully, but these errors were encountered: