-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cifar10 - Memory access fault during training #300
Comments
Just to repeat the obvious information: I'm currently using the low-end Baffin chip with only 2GiB of memory. It turns out that lots of tensorflow examples are not working due to the limited memory. But they mostly fail in a "good" way by means of reporting early that memory allocation failed. Aside from the access fault, another observation is that it happens for me that the cifar10 training sometimes freezes. Not the whole system freezes, only the training script. When "frozen", the script still consumes CPU cycles, albeit less intense. I can terminate the script, and re-start. Eventually it does not freeze but cause the "Memory access fault". When I limit the number of training steps it often does finish. But I never managed to run training for more than 10 minutes. |
Out of couriosity I've run the HIP test mentioned in #282. Observations are
|
Thanks @seesturm , the information helps. Agree with you that the application shouldn't exit with VMEM access fault. |
I've now setup my workstation to boot into "multi-user.target". Then I logged into the machine from remote using secure-shell. This way the GPU should basically be unused for graphics. Problem is still present and training never succeeds. Either due to the memory access fault, or freezing is observed. |
@seesturm , thanks for the efforts! |
Sorry, there is basically no (special) output. Only thing which can be seen are the usual Evicting/Restoring PASID messages. Here is the "final" output before freezing:
The only other information I can provide is that according to "sensors" amdgpu-pci-4100 consumes about 35W during training and 20W when "frozen". When terminating training GPU consumption reduces to (normal idle) 10W. While in frozen state the training process consumes about 10% of a single CPU core. |
Usually I'm running the training while prime95 is running in the background (on the CPU). Now I tested without prime95 and training seems to run stable. Really strange, since from the CPU (1950X) point of view the system is 100% stable. Power consumption should also not be an issue since the supply is rated for 850W. |
The dmesg looks normal, and it's great to know the training is stable now. |
Now I've run the training with active wayland graphics session. Training does also complete sucessfully within graphics session. Although I now have a workaround I'm still asking myself if my hardware is faulty. Or is there some problem which I should report on the ROCm software? |
Hi @seesturm , can you try the following step and see if that can fix your issue?
|
Tried it now with the downgraded packages and p95 in the background. Still seeing the access fault. |
Ran it now with the environment variables. export HCC_SERIALIZE_KERNEL=0x3
export HCC_SERIALIZE_COPY=0x3
export HIP_TRACE_API=0x2
export MIOPEN_ENABLE_LOGGING_CMD=1 This time it did not cause an access fault but got stuck. The last 2700 debug output lines are attached here: |
Just in this moment I got another output with an access fault: |
@seesturm , can you only run the cifar10 training without P95 in the background? |
It freezes even without P95, it is just less likely. |
@seesturm , I can not reproduce the failure locally using the downgraded OpenCL packages on Polaris GPUs:
|
@sunway513 I see. Best I can do is to report the behavior I'm getting and hope that it helps. Could well be that the cause of my observations is completely different from all the other reports. |
Thanks @seesturm , appreciate your data points :-) |
…eil, Cos, FNeg, CopySign. Added test cases for the newly added LLVM operations and lowering features. Closes #300 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#300 from dfki-jugr:std_to_llvm da6168bbc1a369ae2e99ad3881fdddd82f075dd4 PiperOrigin-RevId: 286231169 Change-Id: I364bda2adad4054052a3a53b4cb1c3de1102864f
System information
Describe the current behavior
Training terminates early with message:
Memory access fault by GPU node-2 (Agent handle: 0x2a6c3e0) on address 0x120606e000. Reason: Page not present or supervisor privilege.
Describe the expected behavior
Training finishes without "Memory access fault".
Code to reproduce the issue
git clone https://github.com/tensorflow/models.git cd models/tutorials/image/cifar10 python3 cifar10_train.py
Other info / logs
dmesg.log
tf.log
The text was updated successfully, but these errors were encountered: