-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory access fault by GPU node-1 (Agent handle: 0x2e0dbf0) on address 0x6dccc0000. Reason: Page not present or supervisor privilege. #302
Comments
Thanks for reporting the issue, @fendiwira. We'll take a look. |
@parallelo / @sunway513 it seems quite a few recent issues raised are based on |
@parallelo and mine #297 (the neglected one) :) |
The issue has been identified a regression in ROCM2.0 user bits, only for Polaris; will keep posted here for further updates. |
For future users who hit similar This error typically occurs with an out of bounds memory access on the GPU. The first step is to serialize all GPU kernels & copies, then dump out the kernel names that are launching.
Often (but not always) the last printed kernel will be the one to further investigate -- it might point to a numerical library or something else that can potentially be triaged with a smaller test case. More tips are listed here: https://rocm-documentation.readthedocs.io/en/latest/Other_Solutions/Other-Solutions.html |
Thanks for the prompt response Here I attach the last kernel print out:
|
Thanks, that's helpful. Next, would you be able to additionally run with the following:
Then, please send us the last section of the log. |
OK, Here the result
|
|
Hi @fendiwira , can you try the following step and see if that can fix your issue:
|
Hi @sunway513 thanks, |
The most suspicious thing I've found is this:
The total allocated private size is 20 bytes, and this is accessing 20 bytes off the scratch wave offset. It's possible the base pointer here is negative, but as far as I can tell that isn't possible here |
Nevermind, this only appears in my mangled version trying to find the fault point |
I am getting this error on my RX580 too. I have pared down my code to isolate the problem:
This fails with However, interestingly, when
is replaced with
it succeeds without problems. EDIT: after downgrading to 1.2.0-2018111340 it works perfectly. |
Hi @sunway513 it's works thank you.. |
@fendiwira thanks for the feedback! Will update when there's an official fix available. |
also works here I had similar problem while training model for object detection using faster rcnn inception v2 because but that downgrade it worked again |
Same problem here on my RX480 when training a VGG16 network. |
I put @eukaryote31's test on gist for easier reproduction: https://gist.github.com/Bengt/2d4b8535c781ded2b9ce653cfe7b0eeb I am reproducing using ROCm 2.1 and Tensorflow 1.12:
The test completes without error on CPU (Threadripper 1950X):
The test fails with the aforementioned
The downgrade suggested by @sunway513 works for me too:
|
The issue persists and the downgrade still fixes it with today's |
This issue persists with |
I have this same issue using a R9 Fury card, following the installation guide https://rocm.github.io/tensorflow.html The downgrade indeed fixed the issue. A "true" fix would be preferable. Let me know if you need anything (config details, tests...). |
Hi all, we have included a set of OpenCL toolchain fixes for GFX803 targets in ROCm2.5, in my local GFX803 setup with ROCm2.5 docker image, VM fault is no longer reproducible using the reduced test from @Bengt. |
Hello @sunway513, I tried the new image on R9 Fury (non X) and am still getting this issue when running the following command:
BTW, I had to copy |
Hi @gaetanbahl , VGG16 can run correctly on my local GFX803 setup using ROCm2.5 docker image. For the concern on gfx803 MIOpen perfDB, MIOpen by default provides the following performance database: |
I am using the docker image you mentionned.
Oh, I guess I should upgrade rock-dkms, sorry... I will upgrade and try again. |
@sunway513 Indeed, I don't get the memory error anymore, only the .txt thing. Thanks for your help! Can you confirm that simply copying the |
@gaetanbahl , thanks for the update :-) |
Can confirm that the crash doesn't occur anymore on my RX 480. Thank you for your hard work |
Thank you @LithiumSR for confirming it! |
I can confirm the test working under |
Am not sure if to open a new issue because am having the same issue but with gfx900 (Vega 64). |
@urugn can you try the docker container: |
Same problem with miner on gfx900 (Vega FE) |
Same problem on Vega M GH, setting |
I ported the test to TensorFlow 2:
It still works with image |
Same problem on Radeon VII running custom hip ported code distributed via ray. The code runs flawless without ray. On nvidia no problems with non-ported code and ray. |
This problem is still exist when I use latest docker of rocm/tensorflow.I have been trying since yesterday. |
Another Radeon VII with the same issue (on AI Benchmark): MIOpen Error: /root/driver/MLOpen/src/gemm_v2.cpp:523: rocBlas error encountered ROCm: 3.5.0 |
Can somebody rehost the dropbox files in the fix that @sunway513 did. They are no longer availlable and I cannot issue the commands. Thanks! |
I also tried to install AMDGPU-PRO but opencl wasn't available. I was able to install ROCm and OpenCL is now detected but I also have this error. |
Yeah I eventually figured that out. Turns out 3.8 is broken(at least for me), and after many hours trying to configure a docker container with the "apparently" working 2.5 downgrade, I ran into more compatibility issues with python since it utilises python 3.5. if the apt-get hosted lower versions I could've just downgraded the version on my local machine. Anyways I've decided to just use colab now! |
The OpenCL packages I posted last year can be found here: @Extarys @spades1404 Can you help create a new issue and provide the following information:
|
Hello guys..
I am having issue to run rocm tensorflow with detail as follow:
System information
I try to run this keras tensorflow codes :
Keras Mask RCNN : https://github.com/matterport/Mask_RCNN
Keras SSD : https://github.com/pierluigiferrari/ssd_keras
recongnized as:
name: Ellesmere [Radeon RX 470/480]
AMDGPU ISA: gfx803
memoryClockRate (GHz) 1.34
pciBusID 0000:01:00.0
Total memory: 8.00GiB
Free memory: 7.75GiB
Describe the current behavior
Epoch 1/30
2019-01-29 22:25:46.392668: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data
2019-01-29 22:25:46.446704: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
Memory access fault by GPU node-1 (Agent handle: 0x2e0dbf0) on address 0x6dccc0000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
Describe the expected behavior
Running normally until epoch 30/30
Code to reproduce the issue
Keras Mask RCNN
python3 platno.py train --dataset=/home/path/to/dataset --weights=coco
Always getting error with core dumped as above message
Keras SSD
python3 ssd300_training.py
can run normally when lowering batch size from 32 to 8
python3 ssd7_training.py
getting core dumped even lowering batch size to 1
Other info / logs
Have tried to enable some env variable for debug but still get error:
HSA_ENABLE_SDMA=0
HSA_ENABLE_INTERRUPT=0
HSA_SVM_GUARD_PAGES=0
HSA_DISABLE_CACHE=1
Please assist how to resolve this problem
Thanks and Regards
The text was updated successfully, but these errors were encountered: