Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss NaN, incorrect value Nan or Inf in input Tensor #27

Open
wangq95 opened this issue Mar 18, 2020 · 6 comments
Open

Loss NaN, incorrect value Nan or Inf in input Tensor #27

wangq95 opened this issue Mar 18, 2020 · 6 comments

Comments

@wangq95
Copy link

wangq95 commented Mar 18, 2020

Hi, @huaifeng1993 , I try to train DFANet on cityscapes dataset, but the loss directly became Nan, like this:
Epoch 0/1499
step: 0/298 | loss: 910.1328 | IoU of batch: 0.0249
step: 1/298 | loss: 2907000799232.0000 | IoU of batch: 0.0004
step: 2/298 | loss: nan | IoU of batch: 0.0303
step: 3/298 | loss: nan | IoU of batch: 0.0233

I wonder if there are special processes for training data, or other solutions to solve this proplem.
Thanks a lot.

@huaifeng1993
Copy link
Owner

Trying smaller learning rate. The new model has not tested.

@wangq95
Copy link
Author

wangq95 commented Mar 19, 2020

@huaifeng1993 Hi, I use the default XceptionA as backbone network without pre-trained weigths, as the pre-training costs to much time. I try to decrease the learning rate to 1e-2, 1e-3 and 1e-4, nothing changed, and I also print out the information of input images and labels, which didn't have any inf or nan value, but the output of DFANet had.

@huaifeng1993
Copy link
Owner

what is time you downloaded the dfanet.py. there is a version that has the problem you meet. checking the dfanet.py see if the code same as what you downloaded.

@wangq95
Copy link
Author

wangq95 commented Mar 21, 2020

@huaifeng1993 Yep, loss decreases in a right way. But I find the usage of GPU is too low, and the speed of inference is only 21 fps with default resolution on Tesla V100. What is the key limitation do you think, is the CPU I/O or the number of workers?

@huaifeng1993
Copy link
Owner

i think i/o takes a lot of time.But you can try loading data into gpu previously when you test the model inference's speed.

@xhding1997
Copy link

@huaifeng1993 Yep, loss decreases in a right way. But I find the usage of GPU is too low, and the speed of inference is only 21 fps with default resolution on Tesla V100. What is the key limitation do you think, is the CPU I/O or the number of workers?

hellow, can you share your XceptionA backbone?
i can't find it anywhere.
THANKS !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants