Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cifar10 - Memory access fault during training #300

Closed
seesturm opened this issue Jan 26, 2019 · 19 comments
Closed

cifar10 - Memory access fault during training #300

seesturm opened this issue Jan 26, 2019 · 19 comments

Comments

@seesturm
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 on Linux 4.20.3
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: n/a
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v1.12.0-871-gf480b4a 1.12.0
  • Python version: 3.6.7
  • Bazel version (if compiling from source): n/a
  • GCC/Compiler version (if compiling from source): n/a
  • CUDA/cuDNN version: ROCm 2.0
  • GPU model and memory: Radeon RX 460 2GB

Describe the current behavior
Training terminates early with message:
Memory access fault by GPU node-2 (Agent handle: 0x2a6c3e0) on address 0x120606e000. Reason: Page not present or supervisor privilege.

Describe the expected behavior
Training finishes without "Memory access fault".

Code to reproduce the issue

git clone https://github.com/tensorflow/models.git
cd models/tutorials/image/cifar10
python3 cifar10_train.py

Other info / logs
dmesg.log
tf.log

@sunway513
Copy link

@seesturm I can not repro the issue on my RX480 8GB Polaris GPU using TF1.12 and ROCm2.0.
However, the failure signature does look similar to the issue reported on issue #282 and #301.

@seesturm
Copy link
Author

Just to repeat the obvious information: I'm currently using the low-end Baffin chip with only 2GiB of memory. It turns out that lots of tensorflow examples are not working due to the limited memory. But they mostly fail in a "good" way by means of reporting early that memory allocation failed.

Aside from the access fault, another observation is that it happens for me that the cifar10 training sometimes freezes. Not the whole system freezes, only the training script. When "frozen", the script still consumes CPU cycles, albeit less intense. I can terminate the script, and re-start. Eventually it does not freeze but cause the "Memory access fault". When I limit the number of training steps it often does finish. But I never managed to run training for more than 10 minutes.

@seesturm
Copy link
Author

Out of couriosity I've run the HIP test mentioned in #282. Observations are

  • Test 74 (hipEventRecord) takes 118 seconds, which is longer than any of the other tests. During that thest the system becomes very laggy (e.g. mouse movement can no longer be seen).
  • Test 108 (hipStreamCreateWithPriority) failed. Message during test is: "***Failed Required regular expression not found.Regex=[PASSED]"
  • Aside from Test 108, all other tests Passed.

@sunway513
Copy link

Thanks @seesturm , the information helps. Agree with you that the application shouldn't exit with VMEM access fault.
One question, does your system use the same RX460 dGPU for graphics rendering, e.g. directly connects to the monitor?
We recommend users to use dedicated GPU device for computing, can you try that?

@seesturm
Copy link
Author

I've now setup my workstation to boot into "multi-user.target". Then I logged into the machine from remote using secure-shell. This way the GPU should basically be unused for graphics.

Problem is still present and training never succeeds. Either due to the memory access fault, or freezing is observed.

@sunway513
Copy link

@seesturm , thanks for the efforts!
Can you provide the dmesg while you see system freezing?
You can do so by opening an additional terminal, use the following command:
"dmesg -wH"
The dmesg output after system got freezing would be very helpful.

@seesturm
Copy link
Author

Sorry, there is basically no (special) output. Only thing which can be seen are the usual Evicting/Restoring PASID messages. Here is the "final" output before freezing:

[Jan28 20:39] Evicting PASID 32768 queues
[  +0,005707] Restoring PASID 32768 queues
[ +23,778411] Evicting PASID 32768 queues
[  +0,009995] Restoring PASID 32768 queues
[Jan28 20:40] Evicting PASID 32768 queues
[  +0,005739] Restoring PASID 32768 queues

The only other information I can provide is that according to "sensors" amdgpu-pci-4100 consumes about 35W during training and 20W when "frozen". When terminating training GPU consumption reduces to (normal idle) 10W. While in frozen state the training process consumes about 10% of a single CPU core.

@seesturm
Copy link
Author

Usually I'm running the training while prime95 is running in the background (on the CPU). Now I tested without prime95 and training seems to run stable.

Really strange, since from the CPU (1950X) point of view the system is 100% stable. Power consumption should also not be an issue since the supply is rated for 850W.

@sunway513
Copy link

The dmesg looks normal, and it's great to know the training is stable now.
Not sure why prime95 process can cause the GPU application unstable, I would guess it's related to your system memory utilization rates.

@seesturm
Copy link
Author

Now I've run the training with active wayland graphics session. Training does also complete sucessfully within graphics session.

Although I now have a workaround I'm still asking myself if my hardware is faulty. Or is there some problem which I should report on the ROCm software?

@sunway513
Copy link

Hi @seesturm , can you try the following step and see if that can fix your issue?

cd ~ && mkdir rocm1.9.2-opencl && cd rocm1.9.2-opencl &&
wget https://www.dropbox.com/s/rtwe1zrpuphbyqm/rocm-opencl-1.2.0-2018111340_amd64.deb && 
wget https://www.dropbox.com/s/6gp2g5zju66i4e9/rocm-opencl-dev-1.2.0-2018111340_amd64.deb && 
sudo dpkg -i rocm-opencl*.deb && rm -rf ~/.cache

@seesturm
Copy link
Author

Tried it now with the downgraded packages and p95 in the background. Still seeing the access fault.

@seesturm
Copy link
Author

Ran it now with the environment variables.

export HCC_SERIALIZE_KERNEL=0x3
export HCC_SERIALIZE_COPY=0x3
export HIP_TRACE_API=0x2
export MIOPEN_ENABLE_LOGGING_CMD=1

This time it did not cause an access fault but got stuck. The last 2700 debug output lines are attached here:
debug.log.gz

@seesturm
Copy link
Author

Just in this moment I got another output with an access fault:
debug-fault.txt.gz

@sunway513
Copy link

@seesturm , can you only run the cifar10 training without P95 in the background?

@seesturm
Copy link
Author

It freezes even without P95, it is just less likely.

@sunway513
Copy link

@seesturm , I can not reproduce the failure locally using the downgraded OpenCL packages on Polaris GPUs:

MIOpenDriver: conv -n 128 -c 64 -H 12 -W 12 -k 64 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -t 1
MIOpen Forward Conv. Algorithm: 3
GPU Kernel Time Forward Conv. Elapsed: 0.853023 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv5x5u1, 128, 64, 12, 12, 5, 5, 64,  3774873600, 5128192, 4718592, 4425, 12, 0.853023
Forward Convolution Verifies on CPU and GPU (7.73642e-08)
MIOpen Backward Data Conv. Algorithm: 3
GPU Kernel Time Backward Data Conv. Elapsed: 0.803156 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdd-conv5x5u1, 128, 64, 5, 5, 64, 12, 12,  3774873600, 2506752, 4718592, 4700, 9, 0.803156
MIOpen Backward Weights Conv. Algorithm: 1
GPU Kernel Time Backward Weights Conv. Elapsed: 3.843384 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv5x5u1, 128, 64, 12, 12, 5, 5, 64,  3774873600, 0, 0, 982, 0, 3.843384
Backward Convolution Data Verifies on CPU and GPU (6.79376e-08)
Backward Convolution Weights Verifies on CPU and GPU (1.34523e-07)```

@seesturm
Copy link
Author

@sunway513 I see. Best I can do is to report the behavior I'm getting and hope that it helps. Could well be that the cause of my observations is completely different from all the other reports.

@sunway513
Copy link

Thanks @seesturm , appreciate your data points :-)

sunway513 pushed a commit that referenced this issue Dec 19, 2019
…eil, Cos, FNeg, CopySign.

Added test cases for the newly added LLVM operations and lowering features.

Closes #300

COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#300 from dfki-jugr:std_to_llvm da6168bbc1a369ae2e99ad3881fdddd82f075dd4
PiperOrigin-RevId: 286231169
Change-Id: I364bda2adad4054052a3a53b4cb1c3de1102864f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants