cifar10 - Memory access fault during training #300

seesturm · 2019-01-26T13:19:44Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 on Linux 4.20.3
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: n/a
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v1.12.0-871-gf480b4a 1.12.0
Python version: 3.6.7
Bazel version (if compiling from source): n/a
GCC/Compiler version (if compiling from source): n/a
CUDA/cuDNN version: ROCm 2.0
GPU model and memory: Radeon RX 460 2GB

Describe the current behavior
Training terminates early with message:
Memory access fault by GPU node-2 (Agent handle: 0x2a6c3e0) on address 0x120606e000. Reason: Page not present or supervisor privilege.

Describe the expected behavior
Training finishes without "Memory access fault".

Code to reproduce the issue

git clone https://github.com/tensorflow/models.git
cd models/tutorials/image/cifar10
python3 cifar10_train.py

Other info / logs
dmesg.log
tf.log

sunway513 · 2019-01-27T03:59:09Z

@seesturm I can not repro the issue on my RX480 8GB Polaris GPU using TF1.12 and ROCm2.0.
However, the failure signature does look similar to the issue reported on issue #282 and #301.

seesturm · 2019-01-27T10:25:00Z

Just to repeat the obvious information: I'm currently using the low-end Baffin chip with only 2GiB of memory. It turns out that lots of tensorflow examples are not working due to the limited memory. But they mostly fail in a "good" way by means of reporting early that memory allocation failed.

Aside from the access fault, another observation is that it happens for me that the cifar10 training sometimes freezes. Not the whole system freezes, only the training script. When "frozen", the script still consumes CPU cycles, albeit less intense. I can terminate the script, and re-start. Eventually it does not freeze but cause the "Memory access fault". When I limit the number of training steps it often does finish. But I never managed to run training for more than 10 minutes.

seesturm · 2019-01-27T11:27:49Z

Out of couriosity I've run the HIP test mentioned in #282. Observations are

Test 74 (hipEventRecord) takes 118 seconds, which is longer than any of the other tests. During that thest the system becomes very laggy (e.g. mouse movement can no longer be seen).
Test 108 (hipStreamCreateWithPriority) failed. Message during test is: "***Failed Required regular expression not found.Regex=[PASSED]"
Aside from Test 108, all other tests Passed.

sunway513 · 2019-01-27T19:55:51Z

Thanks @seesturm , the information helps. Agree with you that the application shouldn't exit with VMEM access fault.
One question, does your system use the same RX460 dGPU for graphics rendering, e.g. directly connects to the monitor?
We recommend users to use dedicated GPU device for computing, can you try that?

seesturm · 2019-01-28T19:04:41Z

I've now setup my workstation to boot into "multi-user.target". Then I logged into the machine from remote using secure-shell. This way the GPU should basically be unused for graphics.

Problem is still present and training never succeeds. Either due to the memory access fault, or freezing is observed.

sunway513 · 2019-01-28T19:08:29Z

@seesturm , thanks for the efforts!
Can you provide the dmesg while you see system freezing?
You can do so by opening an additional terminal, use the following command:
"dmesg -wH"
The dmesg output after system got freezing would be very helpful.

seesturm · 2019-01-28T19:49:53Z

Sorry, there is basically no (special) output. Only thing which can be seen are the usual Evicting/Restoring PASID messages. Here is the "final" output before freezing:

[Jan28 20:39] Evicting PASID 32768 queues
[  +0,005707] Restoring PASID 32768 queues
[ +23,778411] Evicting PASID 32768 queues
[  +0,009995] Restoring PASID 32768 queues
[Jan28 20:40] Evicting PASID 32768 queues
[  +0,005739] Restoring PASID 32768 queues

The only other information I can provide is that according to "sensors" amdgpu-pci-4100 consumes about 35W during training and 20W when "frozen". When terminating training GPU consumption reduces to (normal idle) 10W. While in frozen state the training process consumes about 10% of a single CPU core.

seesturm · 2019-01-28T20:50:17Z

Usually I'm running the training while prime95 is running in the background (on the CPU). Now I tested without prime95 and training seems to run stable.

Really strange, since from the CPU (1950X) point of view the system is 100% stable. Power consumption should also not be an issue since the supply is rated for 850W.

sunway513 · 2019-01-28T21:15:57Z

The dmesg looks normal, and it's great to know the training is stable now.
Not sure why prime95 process can cause the GPU application unstable, I would guess it's related to your system memory utilization rates.

seesturm · 2019-01-29T07:44:54Z

Now I've run the training with active wayland graphics session. Training does also complete sucessfully within graphics session.

Although I now have a workaround I'm still asking myself if my hardware is faulty. Or is there some problem which I should report on the ROCm software?

sunway513 · 2019-01-30T16:53:11Z

Hi @seesturm , can you try the following step and see if that can fix your issue?

cd ~ && mkdir rocm1.9.2-opencl && cd rocm1.9.2-opencl &&
wget https://www.dropbox.com/s/rtwe1zrpuphbyqm/rocm-opencl-1.2.0-2018111340_amd64.deb && 
wget https://www.dropbox.com/s/6gp2g5zju66i4e9/rocm-opencl-dev-1.2.0-2018111340_amd64.deb && 
sudo dpkg -i rocm-opencl*.deb && rm -rf ~/.cache

seesturm · 2019-01-30T18:38:45Z

Tried it now with the downgraded packages and p95 in the background. Still seeing the access fault.

seesturm · 2019-01-30T19:11:48Z

Ran it now with the environment variables.

export HCC_SERIALIZE_KERNEL=0x3
export HCC_SERIALIZE_COPY=0x3
export HIP_TRACE_API=0x2
export MIOPEN_ENABLE_LOGGING_CMD=1

This time it did not cause an access fault but got stuck. The last 2700 debug output lines are attached here:
debug.log.gz

seesturm · 2019-01-30T19:18:47Z

Just in this moment I got another output with an access fault:
debug-fault.txt.gz

sunway513 · 2019-01-30T19:23:34Z

@seesturm , can you only run the cifar10 training without P95 in the background?

seesturm · 2019-01-30T20:11:39Z

It freezes even without P95, it is just less likely.

sunway513 · 2019-01-30T20:29:14Z

@seesturm , I can not reproduce the failure locally using the downgraded OpenCL packages on Polaris GPUs:

MIOpenDriver: conv -n 128 -c 64 -H 12 -W 12 -k 64 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -t 1
MIOpen Forward Conv. Algorithm: 3
GPU Kernel Time Forward Conv. Elapsed: 0.853023 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv5x5u1, 128, 64, 12, 12, 5, 5, 64,  3774873600, 5128192, 4718592, 4425, 12, 0.853023
Forward Convolution Verifies on CPU and GPU (7.73642e-08)
MIOpen Backward Data Conv. Algorithm: 3
GPU Kernel Time Backward Data Conv. Elapsed: 0.803156 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdd-conv5x5u1, 128, 64, 5, 5, 64, 12, 12,  3774873600, 2506752, 4718592, 4700, 9, 0.803156
MIOpen Backward Weights Conv. Algorithm: 1
GPU Kernel Time Backward Weights Conv. Elapsed: 3.843384 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv5x5u1, 128, 64, 12, 12, 5, 5, 64,  3774873600, 0, 0, 982, 0, 3.843384
Backward Convolution Data Verifies on CPU and GPU (6.79376e-08)
Backward Convolution Weights Verifies on CPU and GPU (1.34523e-07)```

seesturm · 2019-01-30T20:39:55Z

@sunway513 I see. Best I can do is to report the behavior I'm getting and hope that it helps. Could well be that the cause of my observations is completely different from all the other reports.

sunway513 · 2019-01-30T20:41:12Z

Thanks @seesturm , appreciate your data points :-)

…eil, Cos, FNeg, CopySign. Added test cases for the newly added LLVM operations and lowering features. Closes #300 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#300 from dfki-jugr:std_to_llvm da6168bbc1a369ae2e99ad3881fdddd82f075dd4 PiperOrigin-RevId: 286231169 Change-Id: I364bda2adad4054052a3a53b4cb1c3de1102864f

sunway513 mentioned this issue Jan 27, 2019

Tensorpack FasterRCNN Memory access fault by GPU node-1 ... Reason: Page not present or supervisor privilege. #301

Closed

sunway513 closed this as completed Jan 28, 2019

parallelo mentioned this issue Jan 29, 2019

Memory access fault by GPU node-1 (Agent handle: 0x2e0dbf0) on address 0x6dccc0000. Reason: Page not present or supervisor privilege. #302

Closed

sunway513 mentioned this issue Feb 7, 2019

Which GPU is supported for simultaneous graphical desktop usage? #312

Closed

Bengt mentioned this issue Jun 28, 2019

EfficientNet inference yields incorrect results on GPU #519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cifar10 - Memory access fault during training #300

cifar10 - Memory access fault during training #300

seesturm commented Jan 26, 2019

sunway513 commented Jan 27, 2019

seesturm commented Jan 27, 2019

seesturm commented Jan 27, 2019

sunway513 commented Jan 27, 2019

seesturm commented Jan 28, 2019

sunway513 commented Jan 28, 2019

seesturm commented Jan 28, 2019

seesturm commented Jan 28, 2019

sunway513 commented Jan 28, 2019

seesturm commented Jan 29, 2019

sunway513 commented Jan 30, 2019

seesturm commented Jan 30, 2019

seesturm commented Jan 30, 2019

seesturm commented Jan 30, 2019

sunway513 commented Jan 30, 2019

seesturm commented Jan 30, 2019

sunway513 commented Jan 30, 2019

seesturm commented Jan 30, 2019

sunway513 commented Jan 30, 2019

cifar10 - Memory access fault during training #300

cifar10 - Memory access fault during training #300

Comments

seesturm commented Jan 26, 2019

sunway513 commented Jan 27, 2019

seesturm commented Jan 27, 2019

seesturm commented Jan 27, 2019

sunway513 commented Jan 27, 2019

seesturm commented Jan 28, 2019

sunway513 commented Jan 28, 2019

seesturm commented Jan 28, 2019

seesturm commented Jan 28, 2019

sunway513 commented Jan 28, 2019

seesturm commented Jan 29, 2019

sunway513 commented Jan 30, 2019

seesturm commented Jan 30, 2019

seesturm commented Jan 30, 2019

seesturm commented Jan 30, 2019

sunway513 commented Jan 30, 2019

seesturm commented Jan 30, 2019

sunway513 commented Jan 30, 2019

seesturm commented Jan 30, 2019

sunway513 commented Jan 30, 2019