-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems during training. #15
Comments
Hi @XuyangBai, The first error is quite strange and I never encountered such behavior on my datasets. It is very unlikely that the loss really became zero if you use correct augmentation strategies. It seems more like a bug, but it will be difficult to help you without reproducing your experiments with your dataset. The second one could be explained by your dataset. The GPU memory that you see from What does your data look like? Real or artificial point clouds? Indoor outdour scenes? Objects? |
Hi @HuguesTHOMAS , I also think the first error may because some bugs of my implementation. I just want to check whether it is because the tensorflow version. BTW, what's the situation when using TF 1.13 and CUDA 10 ? For the second one, I mean that for some experiments, the GPU memory is always 4400MB and for others is always 7000+MB. It is really strange. But I check the |
Oh Sorry I think I find the reason of second problem. I forget the dropout. It seems when I use dropout = 0.5, the GPU memory is around 4400 MB while using dropout = 1, the GPU memory is 7000MB. Sorry for the bothering. |
Hi @XuyangBai, If you look at the code, the dropout variable is extremely important in the implementation, because the network use it to know if you use it for training or for test. If you use a dropout < 0.99, the network is in training configuration, and if you use a dropout = 1, the network is in test configuration. This is a trick that I used to avoid creating a 'training/test' boolean placeholder, and that I never corrected. It will be corrected in the next month (I currently don't have any time to spend on the code). Until then, you should not use dropout = 1 when training, as the variables will not be updated by gradient back propagation in that case. If you have dropout blocks and don't want to use them, just remove them or use dropout = 0.98 and they will be insignificant. Best, |
Thanks a lot for your reply :) |
Hi @XuyangBai, I have noticed similar behaviour as you have described. I have not been able to debug it as it occurs randomly. |
Hi @nejcd , I didn't find the solution, so I just change my environment to |
Hi, I had some time to dig into this problem and it seems that CUDA10 is not working correctly with RTX 2080Ti GPUs. Here is what I found: Tested configurations
Origin of the bug KPConv/kernels/convolution_ops.py Line 240 in 5f9ceca
Before the appearance of NaN I noticed some weird value higher than 1e10. If you print the two matrix that are multiplied and the result matrix, you will see that the result is completly false. This seems to be caused by a CUDA internal bug. At some point one of this mistake lead to a value so high that it becomes NaN and the network crashes. For now I would just advise to avoid using CUDA10 with a RTX 2080ti. |
I ported the model to Keras layers and tried training it on a Tesla V100 GPU (CUDA10.2:tf2.0) with the result of also getting NaN values after some epochs. After changing the KP influence from guassian to linear everything worked fine, so I would assume the issue lies in the gradient computation for the Gaussian influence, although increasing the epsilon from 1e-9 to 1e-6 did not resolve the problem. But the linear influence works just fine and in my case leads to good results with higher computational efficiency. |
Thanks for your great work,I have met this problem for a long time,i want to konw the version of python. |
If I remember correctly, the python version was 3.5 or 3.6. IF you are willing to switch libraries, a newer implementation has been released in PyTorch. |
Just to mention my experience: I got NaN when I used Tensorflow 1.15, cuda 10 cudnn 7.6.5 |
I use TF 1.15 and also occur NaN. You can solve this via reducing the batch size to 2. |
Nice. |
Hi, Thanks for your sharing. I have tried your code on my own dataset but the I found that initially everything goes well but after several epochs the training suddenly broke up ( accuracy becomes 1 and the loss becomes 0 ) I use tf 1.12.0 and the cuda version is 9.0, cudnn version is 7.1.4
# conda list | grep tensorflow tensorflow-estimator 1.13.0 py_0 anaconda tensorflow-gpu 1.12.0 pypi_0 pypi tensorflow-tensorboard 0.4.0 pypi_0 pypi
Have you met this kind of problem? Another potential problem is that sometimes the training takes 4400 MB GPU memory (see from
nvidia-smi
), but sometimes it takes more than 7000 MB ( and I do not change the batch size and network architecture) I am pretty confused about these problems. Could you give me some advice?The text was updated successfully, but these errors were encountered: