Memory Access Fault or loss diverges when running Tensorflow benchmarks #394

xianlopez · 2019-04-04T11:42:58Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 (kernel 4.15)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): pip install tensorflow-rocm
TensorFlow version (use command below): 1.13.1
Python version: 3.6.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
ROCm/MIOpen version: ROCm 2.2.31
GPU model and memory: Radeon VII 16GB

Describe the current behavior

Quite often, when running the Tensorflow benchmarks, I get memory errors or the loss diverges. The memory errors are Memory Access Fault (as described in issue 302). When the loss diverges, at some step it either goes to infinity or becomes Not a Number.

I don't obtain this behavior every time, but around 50% of the times. I tried with the models vgg16, resnet50 and inception4, with FP32, and batch sizes of 32, 64 and 128.

I am mostly concerned about the loss divergence, since I see there is already an issue for the memory error (although with a different GPU), and I have also observed this loss divergence with a custom model that I have, and this impedes me from training it.

Describe the expected behavior

When running this benchmarks with a Titan Xp or a GTX 1080Ti (on different machines), there isn't any convergence issue nor memory error.

Code to reproduce the issue

git clone https://github.com/tensorflow/benchmarks.git
cd benchmarks/scripts/tf_cnn_benchmarks
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50

The text was updated successfully, but these errors were encountered:

sunway513 · 2019-04-04T14:33:46Z

Hi @xianlopez , to help triage your issue, could you try to use the official TF-ROCm docker image and run the same commands there?
You can follow the instructions here provided below:
https://cloud.docker.com/u/rocm/repository/docker/rocm/tensorflow

xianlopez · 2019-04-05T11:09:03Z

I am reproducing the issue inside the docker image.

To download and start the container:

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx'

drun rocm/tensorflow:rocm2.2-tf1.13-python3

Inside the container:

cd benchmarks/scripts/tf_cnn_benchmarks
python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception4

Again, sometimes, getting NaN in loss or Memory Access Fault (or both), although maybe less often than before. Model inception4 seems to be the most problematic, but it also happens with others.

sunway513 · 2019-04-05T17:37:34Z

Hi @xianlopez , all those tf_cnn_benchmarks models are under our CI and QA coverage, those shouldn't fail.
Since you can repro the VMEM faults within our official docker image, the issue should be related to your kernel driver or system configurations.

Can you provide your CPU/MB/system memory/PSU details, and provide us the logs from the following commands?
dmesg
uname -a
apt --installed list | grep rock-dkms

Besides, it'll be helpful if you can try to run the HIP unit tests and make sure all those can pass, instructions:
https://github.com/ROCm-Developer-Tools/HIP/tree/master/tests

sunway513 · 2019-04-05T18:30:56Z

Another thing to monitor is your system cooling condition.
When running the benchmarks, could you check the GPU temperatures? Just open another terminal session and use the following command:
watch -n 0.1 /opt/rocm/bin/rocm-smi

You might try to lower the GPU sclk and see if the issue can still be reproducible, here's the command:
/opt/rocm/bin/rocm-smi --setsclk 3

xianlopez · 2019-04-06T12:58:09Z

Thanks for your help, @sunway513

System details
CPU: Ryzen 5 2600 3.4GHz
Motherboard: Asus PRIME X370-PRO
RAM: 16GB DDR4 2400MHz
PSU: 850W

Logs right after starting the computer

I attach the output of dmesg in a file, since it is very long.
dmesg.txt
Something to note about it. The following line appears in red:
[ 5.054230] amdgpu 0000:0b:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

uname -a

xian@ermisenda:~$ uname -a
Linux ermisenda 4.15.18-041518-generic #201804190330 SMP Thu Apr 19 07:34:21 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

apt --installed list | grep rock-dkms

xian@ermisenda:~$ apt --installed list | grep rock-dkms

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

rock-dkms/Ubuntu 16.04,now 2.2-31 all [installed,automatic]

This last log seems strange to me, since it looks like I have a version of rock-dkms for Ubuntu 16.04, whilst I have Ubuntu 18.04.

Temperature monitoring
When running the benchmark, the temperature is around 90ºC, peaking up to 100ºC.
Interestingly enough, after lowering the GPU sclk, as you said, I could run the benchmark without any issue. The temperature in this case was around 80ºC or a bit more, with a maximum of 90ºC.
I hope it is not a cooling problem. My computer has a big case, with several fans. There is only one GPU, and there are no CD drives nor mechanical disks...

Later I'll try to run the HIP unit tests that you mentioned.

xianlopez · 2019-04-07T10:37:56Z

I've ran the HIP unit tests. Two failed. Full output here.

Summary:

98% tests passed, 2 tests failed out of 120

Total Test time (real) = 119.42 sec

The following tests FAILED:
	 63 - directed_tests/runtimeApi/memory/hipMemcpy-size.tst (OTHER_FAULT)
	109 - directed_tests/deviceLib/hipTestNativeHalf.tst (Not Run)
Errors while running CTest
Makefile:127: recipe for target 'test' failed
make: *** [test] Error 8

sunway513 · 2019-04-10T16:22:06Z

Hi @xianlopez , thanks for all those experiments!
The current data seems to indicate a potential GPU anomaly. Can you kindly issue an RMA and provide the details ( Seller name & RMA number)?

xianlopez · 2019-04-23T08:45:46Z

Hi again. I got a new GPU and the problem persisted. So I decided to reformat and install everything again. This time I used Ubuntu 16, instead of 18, and the last ROCm version, 2.3. Now it works fine, the tensorflow benchmarks run without issues. However, the same two HIP tests still fail, and I also get the message amdgpu 0000:0b:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff in dmesg.

sunway513 self-assigned this Apr 4, 2019

sunway513 added the environment config issue related to the local environment configurations label Apr 4, 2019

twuebi mentioned this issue Apr 17, 2019

Memory Access Faults during training #414

Open

xianlopez closed this as completed Apr 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Access Fault or loss diverges when running Tensorflow benchmarks #394

Memory Access Fault or loss diverges when running Tensorflow benchmarks #394

xianlopez commented Apr 4, 2019

sunway513 commented Apr 4, 2019

xianlopez commented Apr 5, 2019

sunway513 commented Apr 5, 2019

sunway513 commented Apr 5, 2019 •

edited

Loading

xianlopez commented Apr 6, 2019

xianlopez commented Apr 7, 2019

sunway513 commented Apr 10, 2019

xianlopez commented Apr 23, 2019

Memory Access Fault or loss diverges when running Tensorflow benchmarks #394

Memory Access Fault or loss diverges when running Tensorflow benchmarks #394

Comments

xianlopez commented Apr 4, 2019

sunway513 commented Apr 4, 2019

xianlopez commented Apr 5, 2019

sunway513 commented Apr 5, 2019

sunway513 commented Apr 5, 2019 • edited Loading

xianlopez commented Apr 6, 2019

xianlopez commented Apr 7, 2019

sunway513 commented Apr 10, 2019

xianlopez commented Apr 23, 2019

sunway513 commented Apr 5, 2019 •

edited

Loading