-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Access Fault or loss diverges when running Tensorflow benchmarks #394
Comments
Hi @xianlopez , to help triage your issue, could you try to use the official TF-ROCm docker image and run the same commands there? |
I am reproducing the issue inside the docker image. To download and start the container:
Inside the container:
Again, sometimes, getting NaN in loss or Memory Access Fault (or both), although maybe less often than before. Model inception4 seems to be the most problematic, but it also happens with others. |
Hi @xianlopez , all those tf_cnn_benchmarks models are under our CI and QA coverage, those shouldn't fail. Can you provide your CPU/MB/system memory/PSU details, and provide us the logs from the following commands? Besides, it'll be helpful if you can try to run the HIP unit tests and make sure all those can pass, instructions: |
Another thing to monitor is your system cooling condition. You might try to lower the GPU sclk and see if the issue can still be reproducible, here's the command: |
Thanks for your help, @sunway513 System details Logs right after starting the computer I attach the output of uname -a
apt --installed list | grep rock-dkms
This last log seems strange to me, since it looks like I have a version of rock-dkms for Ubuntu 16.04, whilst I have Ubuntu 18.04. Temperature monitoring Later I'll try to run the HIP unit tests that you mentioned. |
I've ran the HIP unit tests. Two failed. Full output here. Summary:
|
Hi @xianlopez , thanks for all those experiments! |
Hi again. I got a new GPU and the problem persisted. So I decided to reformat and install everything again. This time I used Ubuntu 16, instead of 18, and the last ROCm version, 2.3. Now it works fine, the tensorflow benchmarks run without issues. However, the same two HIP tests still fail, and I also get the message |
System information
Describe the current behavior
Quite often, when running the Tensorflow benchmarks, I get memory errors or the loss diverges. The memory errors are Memory Access Fault (as described in issue 302). When the loss diverges, at some step it either goes to infinity or becomes Not a Number.
I don't obtain this behavior every time, but around 50% of the times. I tried with the models vgg16, resnet50 and inception4, with FP32, and batch sizes of 32, 64 and 128.
I am mostly concerned about the loss divergence, since I see there is already an issue for the memory error (although with a different GPU), and I have also observed this loss divergence with a custom model that I have, and this impedes me from training it.
Describe the expected behavior
When running this benchmarks with a Titan Xp or a GTX 1080Ti (on different machines), there isn't any convergence issue nor memory error.
Code to reproduce the issue
The text was updated successfully, but these errors were encountered: