-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm-3.9, ROCm-3.10 crash with gfx803 #1269
Comments
I am getting this too ... OS: Ubuntu 18.04 LTS |
Same here... (btw, there is a typo in word Coudn't)
OS: Ubuntu-20.04.1 LTS I followed the guide AMD provided https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html twice, both times in a fresh Ubuntu installation.
I noticed each time you exit a python interactive session where tensorflow was imported it threw the exact same error: I also tried the guide https://www.videogames.ai/Install-ROCM-Machine-Learning-AMD-GPU along with the video (which is more complete) https://www.youtube.com/watch?v=fkSRkAoMS4g without any success. (It fails the same way when you try to run the |
I am getting this too ... OS: Ubuntu 18.04 LTS |
same here:
|
using last docker have the same error:
|
Hello there good folks of github! I have the exact same problem. OS: Ubuntu 20.04.1 LTS I hate to sound negative, but things like these seriously make me want to give up techy things once and for all and just go become a professional shepard... |
Hi @xuhuisheng and others, |
@rkothako is there anyway we can help you further to solve these issues? |
@rkothako any updates? |
@AsimPoptani My adivise is downgrade to ROCm-3.5.1 with gfx803. There are other issues on for ROCm-3.7 and ROCm-3.8 on gfx803. please refer here : #1265 |
How does one downgrade? @xuhuisheng |
Hi @AsimPoptani |
This is what I did : echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/3.5.1/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list ` `Selected version '3.5.1-34' (repo.radeon.com:3.5.1/Ubuntu 16.04 [amd64]) for 'rocm-dkms' The following packages have unmet dependencies. |
Hi @AsimPoptani wget -q -O - http://repo.radeon.com/rocm/apt/3.5.1/rocm.gpg.key | sudo apt-key add - |
@rkothako I will try this soon hopefully and will let you know if I succeeded. |
Hi @rkothako tried that ... However i get this:
|
@VegetaDTX success? |
AMDGPU_TARGETS marked as cache string. When after include Dependencies.cmake, AMDGPU_TARGETS always get cached variable gfx900;gfx906;gfx908, Its means never used AMDGPU_TARGETS. This caused ROCm3.9 crashed on gfx803. ROCm/ROCm#1269
The pull request of rocSPARSE had been merged. Local checked successfully. |
Thank you @xuhuisheng |
@angimenez |
@xuhuisheng |
@angimenez update: fixed ROCm/rocSPARSE@7de1594 |
I apologize for the delayed reply but I was too busy with other stuff and it's also quite inconvenient for me to try it on Ubuntu, because I have a dual boot and most of my other ML work is not on Ubuntu. So far I didn't have luck but I haven't tried the latest advice by @xuhuisheng xuhuisheng yet. |
@VegetaDTX No problem, I have already tested it, and downgrading works as expected 😃 |
@Grench6 Thanks so much for the guide! I am so glad it works. I'll try it as soon as I get some time. I really need it for some of my projects! |
@Grench6 I followed your guide, unfortunately, no success :( Here is what I got :
|
@AsimPoptani Where you able to run the benchmark? Or at least of making the 5 + 2 operation with TF-rocm? |
Hi @AsimPoptani, Looks like you are missing something. |
I am getting this too ... |
same error with last 3.9.1 for me... |
As this ticket is already closed, request to open a new ticket with all detailed steps to reproduce, to discuss there. |
I got the similar issue in one pyorch docker provided by https://hub.docker.com/r/rocm/pytorch/tags |
@mathmax12 |
Thanks a lot for that. I tried to change two CMake file according to the patch.
After run ./install.sh -di I got this :
Did I miss something? |
@mathmax12 Or |
Hey! Thanks for your guide. I got the following results after running:
rocm_clinfo.txt I followed it step by step, but I couldn't get it working... When I run the following benchmark
Any idea why might this be? EDIT: I sorted the OOM problem out by following the solutions posted in this issue tensorflow/tensorflow/issues/40751. However, the Besides, the GPU bandwidth seems to be ridiculously small... |
@borgarpa Syntax warnings of sudo apt-get install rocm-bandwidth-test
rocm-bandwidth-test |
@Grench6 Thanks for the tip. I run the bandwidth test and this is the result:
Is it normal that the CPU cannot access the GPU in the Inter-Device Access test? |
Hi All, |
I now this is close only for report curent status of the patch:
so the patch is still needed for this version ;) |
@xuhuisheng your link in the description is broken, correct is https://github.com/xuhuisheng/rocm-build/blob/master/docs/gfx803.md |
Updated links to info from xuhuisheng. Thanks xuhuisheng. I've not tried it yet, but you guys definitely left a trail of things to try. |
If you installed ROCm-3.9, ROCm-3.10 with gfx803, you will crash on very beginning of running tensorflow or pytorch.
Error info as follows:
OS: Ubuntu-20.04
CPU: Xeon 2620v3
GPU: RX580 8G (Polaris10) CHIP ID: 0x67df
Python: 3.8.5
Tensorflow-rocm: 2.3.1
hip sample run ok.
UPDATE 2020-11-05: The reason is rocsparse is not compiled with gfx803, after compile rocsparse with AMDGPU_TARGETS=gfx803, and reinstalled the custom rocsparse package, this problem solved.
It is a bug on rocSPARSE cmake config, the AMDGPU_TARGETS never by used.
Pull Request had been merged. ROCm/rocSPARSE#213
#1265 is still there.
UPDATE 2020-11-21: wrote a doc for gfx803 issues detals.
https://github.com/xuhuisheng/rocm-build/blob/develop/docs/gfx803.md
The text was updated successfully, but these errors were encountered: