Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Access Fault or loss diverges when running Tensorflow benchmarks #394

Closed
xianlopez opened this issue Apr 4, 2019 · 8 comments
Closed
Assignees
Labels
environment config issue related to the local environment configurations

Comments

@xianlopez
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 (kernel 4.15)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): pip install tensorflow-rocm
  • TensorFlow version (use command below): 1.13.1
  • Python version: 3.6.7
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • ROCm/MIOpen version: ROCm 2.2.31
  • GPU model and memory: Radeon VII 16GB

Describe the current behavior

Quite often, when running the Tensorflow benchmarks, I get memory errors or the loss diverges. The memory errors are Memory Access Fault (as described in issue 302). When the loss diverges, at some step it either goes to infinity or becomes Not a Number.

I don't obtain this behavior every time, but around 50% of the times. I tried with the models vgg16, resnet50 and inception4, with FP32, and batch sizes of 32, 64 and 128.

I am mostly concerned about the loss divergence, since I see there is already an issue for the memory error (although with a different GPU), and I have also observed this loss divergence with a custom model that I have, and this impedes me from training it.

Describe the expected behavior

When running this benchmarks with a Titan Xp or a GTX 1080Ti (on different machines), there isn't any convergence issue nor memory error.

Code to reproduce the issue

git clone https://github.com/tensorflow/benchmarks.git
cd benchmarks/scripts/tf_cnn_benchmarks
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
@sunway513
Copy link

Hi @xianlopez , to help triage your issue, could you try to use the official TF-ROCm docker image and run the same commands there?
You can follow the instructions here provided below:
https://cloud.docker.com/u/rocm/repository/docker/rocm/tensorflow

@sunway513 sunway513 self-assigned this Apr 4, 2019
@sunway513 sunway513 added the environment config issue related to the local environment configurations label Apr 4, 2019
@xianlopez
Copy link
Author

I am reproducing the issue inside the docker image.

To download and start the container:

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx'

drun rocm/tensorflow:rocm2.2-tf1.13-python3

Inside the container:

cd benchmarks/scripts/tf_cnn_benchmarks
python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception4

Again, sometimes, getting NaN in loss or Memory Access Fault (or both), although maybe less often than before. Model inception4 seems to be the most problematic, but it also happens with others.

@sunway513
Copy link

Hi @xianlopez , all those tf_cnn_benchmarks models are under our CI and QA coverage, those shouldn't fail.
Since you can repro the VMEM faults within our official docker image, the issue should be related to your kernel driver or system configurations.

Can you provide your CPU/MB/system memory/PSU details, and provide us the logs from the following commands?
dmesg
uname -a
apt --installed list | grep rock-dkms

Besides, it'll be helpful if you can try to run the HIP unit tests and make sure all those can pass, instructions:
https://github.com/ROCm-Developer-Tools/HIP/tree/master/tests

@sunway513
Copy link

sunway513 commented Apr 5, 2019

Another thing to monitor is your system cooling condition.
When running the benchmarks, could you check the GPU temperatures? Just open another terminal session and use the following command:
watch -n 0.1 /opt/rocm/bin/rocm-smi

You might try to lower the GPU sclk and see if the issue can still be reproducible, here's the command:
/opt/rocm/bin/rocm-smi --setsclk 3

@xianlopez
Copy link
Author

Thanks for your help, @sunway513

System details
CPU: Ryzen 5 2600 3.4GHz
Motherboard: Asus PRIME X370-PRO
RAM: 16GB DDR4 2400MHz
PSU: 850W

Logs right after starting the computer

I attach the output of dmesg in a file, since it is very long.
dmesg.txt
Something to note about it. The following line appears in red:
[ 5.054230] amdgpu 0000:0b:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

uname -a

xian@ermisenda:~$ uname -a
Linux ermisenda 4.15.18-041518-generic #201804190330 SMP Thu Apr 19 07:34:21 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

apt --installed list | grep rock-dkms

xian@ermisenda:~$ apt --installed list | grep rock-dkms

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

rock-dkms/Ubuntu 16.04,now 2.2-31 all [installed,automatic]

This last log seems strange to me, since it looks like I have a version of rock-dkms for Ubuntu 16.04, whilst I have Ubuntu 18.04.

Temperature monitoring
When running the benchmark, the temperature is around 90ºC, peaking up to 100ºC.
Interestingly enough, after lowering the GPU sclk, as you said, I could run the benchmark without any issue. The temperature in this case was around 80ºC or a bit more, with a maximum of 90ºC.
I hope it is not a cooling problem. My computer has a big case, with several fans. There is only one GPU, and there are no CD drives nor mechanical disks...

Later I'll try to run the HIP unit tests that you mentioned.

@xianlopez
Copy link
Author

I've ran the HIP unit tests. Two failed. Full output here.

Summary:

98% tests passed, 2 tests failed out of 120

Total Test time (real) = 119.42 sec

The following tests FAILED:
	 63 - directed_tests/runtimeApi/memory/hipMemcpy-size.tst (OTHER_FAULT)
	109 - directed_tests/deviceLib/hipTestNativeHalf.tst (Not Run)
Errors while running CTest
Makefile:127: recipe for target 'test' failed
make: *** [test] Error 8

@sunway513
Copy link

Hi @xianlopez , thanks for all those experiments!
The current data seems to indicate a potential GPU anomaly. Can you kindly issue an RMA and provide the details ( Seller name & RMA number)?

@xianlopez
Copy link
Author

Hi again. I got a new GPU and the problem persisted. So I decided to reformat and install everything again. This time I used Ubuntu 16, instead of 18, and the last ROCm version, 2.3. Now it works fine, the tensorflow benchmarks run without issues. However, the same two HIP tests still fail, and I also get the message amdgpu 0000:0b:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff in dmesg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
environment config issue related to the local environment configurations
Projects
None yet
Development

No branches or pull requests

2 participants