Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm-3.9, ROCm-3.10 crash with gfx803 #1269

Closed
xuhuisheng opened this issue Oct 29, 2020 · 44 comments · Fixed by ROCm/rocSPARSE#213
Closed

ROCm-3.9, ROCm-3.10 crash with gfx803 #1269

xuhuisheng opened this issue Oct 29, 2020 · 44 comments · Fixed by ROCm/rocSPARSE#213

Comments

@xuhuisheng
Copy link
Contributor

xuhuisheng commented Oct 29, 2020

If you installed ROCm-3.9, ROCm-3.10 with gfx803, you will crash on very beginning of running tensorflow or pytorch.
Error info as follows:

work@0b7758c3094d:~/test/examples/mnist$ python3 main.py
/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")
Aborted (core dumped)

OS: Ubuntu-20.04
CPU: Xeon 2620v3
GPU: RX580 8G (Polaris10) CHIP ID: 0x67df
Python: 3.8.5
Tensorflow-rocm: 2.3.1

hip sample run ok.

work@0b7758c3094d:~/test$ make
/opt/rocm/hip/bin/hipcc  square.cpp -o square.out
work@0b7758c3094d:~/test$ ./square.out
info: running on device Device 67df
info: allocate host mem (  7.63 MB)
info: allocate device mem (  7.63 MB)
info: copy Host2Device
info: launch 'vector_square' kernel
info: copy Device2Host
info: check result
PASSED!

UPDATE 2020-11-05: The reason is rocsparse is not compiled with gfx803, after compile rocsparse with AMDGPU_TARGETS=gfx803, and reinstalled the custom rocsparse package, this problem solved.

It is a bug on rocSPARSE cmake config, the AMDGPU_TARGETS never by used.

Pull Request had been merged. ROCm/rocSPARSE#213

#1265 is still there.

UPDATE 2020-11-21: wrote a doc for gfx803 issues detals.
https://github.com/xuhuisheng/rocm-build/blob/develop/docs/gfx803.md

@AsimPoptani
Copy link

I am getting this too ...

OS: Ubuntu 18.04 LTS
CPU: i76700k
GPU RX480
Tf-R 2.3.1

@Grench6
Copy link

Grench6 commented Oct 30, 2020

Same here... (btw, there is a typo in word Coudn't)

/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!") Aborted (core dumped)

OS: Ubuntu-20.04.1 LTS
CPU: Intel i3-6100
GPU: RX580
Python: 3.8.5
Tensorflow-rocm: 2.3.1

I followed the guide AMD provided https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html twice, both times in a fresh Ubuntu installation.

rocminfo and clinfo seem to be working properly, I will attach the command output bellow
rocminfo.pdf
clinfo.pdf

I noticed each time you exit a python interactive session where tensorflow was imported it threw the exact same error:
python_import_TF.pdf

I also tried the guide https://www.videogames.ai/Install-ROCM-Machine-Learning-AMD-GPU along with the video (which is more complete) https://www.youtube.com/watch?v=fkSRkAoMS4g without any success. (It fails the same way when you try to run the tf_cnn_benchmarks.py script)

@angimenez
Copy link

I am getting this too ...

OS: Ubuntu 18.04 LTS
CPU: Ryzen 7 2700x
GPU RX580
Tf-R 2.3.2

@Djip007
Copy link

Djip007 commented Oct 31, 2020

same here:
OS: Centos 8.2
CPU: Ryzen 9 3900x
GPU RX480
Tf-R 2.3.2

I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so
/data/jenkins_workspace/centos_pipeline_job_8.1_rel-3.9/rocm-rel-3.9/rocm-3.9-17-20201021/8.1/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!"

@Djip007
Copy link

Djip007 commented Oct 31, 2020

using last docker have the same error:

I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so
/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

@VegetaDTX
Copy link

VegetaDTX commented Nov 1, 2020

Hello there good folks of github! I have the exact same problem.
I spent half of the day setting up the environment for working with GPU powered Tensorflow projects and I get this error in the end that I just can't seem to find a solution to :(

OS: Ubuntu 20.04.1 LTS
CPU: AMD® Ryzen 5 1600x six-core processor × 12
GPU: Radeon RX 570 Series (POLARIS10, DRM 3.40.0, 5.4.0-52-generic, LLVM 10.0.0)
Tensorflow-rocm version: 2.3.2

I hate to sound negative, but things like these seriously make me want to give up techy things once and for all and just go become a professional shepard...

@rkothako
Copy link

rkothako commented Nov 2, 2020

Hi @xuhuisheng and others,
Thanks for the issue. Let me check on this.

@xuhuisheng
Copy link
Contributor Author

@rkothako Thank you for replying. Please also check this issue #1265 . The ROCm-3.7 and ROCm-3.8 cannot run on gfx803 correctly. While ROCm-3.9 totally crashed with gfx803.

@VegetaDTX
Copy link

VegetaDTX commented Nov 2, 2020

@rkothako, thanks for your response. Hereby I confirm that, just like @Grench6 , I have tried running tf_cnn_benchmarks.py and got the same error there.

@AsimPoptani
Copy link

@rkothako is there anyway we can help you further to solve these issues?

@AsimPoptani
Copy link

@rkothako any updates?

@xuhuisheng
Copy link
Contributor Author

@AsimPoptani My adivise is downgrade to ROCm-3.5.1 with gfx803. There are other issues on for ROCm-3.7 and ROCm-3.8 on gfx803. please refer here : #1265
The response from AMD wont come so quick.

@AsimPoptani
Copy link

AsimPoptani commented Nov 3, 2020

How does one downgrade? @xuhuisheng

@rkothako
Copy link

rkothako commented Nov 3, 2020

Hi @AsimPoptani
Please use space of http://repo.radeon.com/rocm/apt/3.5.1/ for installing ROCm 3.5.1

@AsimPoptani
Copy link

This is what I did :
sudo apt autoremove rocm-opencl rocm-dkms rocm-dev rocm-utils
`wget -q -O - http://repo.radeon.com/rocm/apt/3.5.1/rocm.gpg.key | sudo apt-key add -

echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/3.5.1/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install rocm-dkms=3.5.*

`
However I get:

`Selected version '3.5.1-34' (repo.radeon.com:3.5.1/Ubuntu 16.04 [amd64]) for 'rocm-dkms'
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies.
rocm-dkms : Depends: rocm-dev but it is not going to be installed
`

@rkothako
Copy link

rkothako commented Nov 3, 2020

Hi @AsimPoptani
On a clean (or) properly ROCm uninstalled machine, follow the below steps to install ROCm 3.5.1.

wget -q -O - http://repo.radeon.com/rocm/apt/3.5.1/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/3.5.1/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install rocm-dkms

@VegetaDTX
Copy link

VegetaDTX commented Nov 3, 2020

@rkothako I will try this soon hopefully and will let you know if I succeeded.

@AsimPoptani
Copy link

Hi @rkothako tried that ... However i get this:

>>> tf.Variable('x',1) 2020-11-03 17:07:32.586903: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2020-11-03 17:07:32.624439: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3999980000 Hz 2020-11-03 17:07:32.625269: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb07a7352b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-11-03 17:07:32.625311: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-11-03 17:07:32.629109: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb07a7cb590 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices: 2020-11-03 17:07:32.629154: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], AMDGPU ISA version: gfx803 2020-11-03 17:07:32.629339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: pciBusID: 0000:01:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] ROCm AMD GPU ISA: gfx803 coreClock: 1.279GHz coreCount: 36 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 0B/s 2020-11-03 17:07:32.629406: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocblas.so 2020-11-03 17:07:32.629447: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libMIOpen.so 2020-11-03 17:07:32.629485: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocfft.so 2020-11-03 17:07:32.629523: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocrand.so 2020-11-03 17:07:32.629635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2020-11-03 17:07:33.418980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-11-03 17:07:33.419003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 2020-11-03 17:07:33.419007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N 2020-11-03 17:07:33.419120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7399 MB memory) -> physical GPU (device: 0, name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], pci bus id: 0000:01:00.0) [1] 108049 segmentation fault (core dumped) python3

@Grench6
Copy link

Grench6 commented Nov 4, 2020

@VegetaDTX success?

ntrost57 pushed a commit to ROCm/rocSPARSE that referenced this issue Nov 5, 2020

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
AMDGPU_TARGETS marked as cache string.
When after include Dependencies.cmake, AMDGPU_TARGETS always get cached variable gfx900;gfx906;gfx908, Its means never used AMDGPU_TARGETS.
This caused ROCm3.9 crashed on gfx803. ROCm/ROCm#1269
@xuhuisheng
Copy link
Contributor Author

The pull request of rocSPARSE had been merged. Local checked successfully.
I will close this issue and hope it will be released soon.

@angimenez
Copy link

Thank you @xuhuisheng
I've downloaded the rocsparse repository and compiled it (including the change you've made), but tensorflow still doesn't work.
Commands:
sudo rm -rf /opt/rocm/rocsparse
mkdir -p build/release; cd build/release
CXX = /opt/rocm/bin/hipcc cmake -DBUILD_CLIENTS_TESTS=ON ../ ..
make
sudo make install
.
.
.
.
The installation was success full, but I have had the same error in TensorFlow.
Are the steps I have followed correct to compile rocsparse ?? Thanks a lot

@xuhuisheng
Copy link
Contributor Author

xuhuisheng commented Nov 6, 2020

@angimenez
Please use rocm-3.9.0 tag or rocm-3.10.x branch. The develop branch is WIP.
I will check the latest codes later.

@angimenez
Copy link

@xuhuisheng
Now it worked for me, thank you very much

@xuhuisheng
Copy link
Contributor Author

xuhuisheng commented Nov 6, 2020

@angimenez
Found a comment about gfx803 on rocSPARSE develop branch. ROCm/rocSPARSE@f8791e9#commitcomment-43334853
We should wait AMD to fix it.

update: fixed ROCm/rocSPARSE@7de1594

@VegetaDTX
Copy link

@Grench6

@VegetaDTX success?

I apologize for the delayed reply but I was too busy with other stuff and it's also quite inconvenient for me to try it on Ubuntu, because I have a dual boot and most of my other ML work is not on Ubuntu. So far I didn't have luck but I haven't tried the latest advice by @xuhuisheng xuhuisheng yet.
When I get enough time and try it, I'll post my experience here.

@Grench6
Copy link

Grench6 commented Nov 11, 2020

@VegetaDTX No problem, I have already tested it, and downgrading works as expected 😃
I wrote a mini-guide here to downgrade and install Rocm, tensorflow-rocm, and test it with a benchmark.

@VegetaDTX
Copy link

@Grench6 Thanks so much for the guide! I am so glad it works. I'll try it as soon as I get some time. I really need it for some of my projects!

@AsimPoptani
Copy link

AsimPoptani commented Nov 11, 2020

@Grench6 I followed your guide, unfortunately, no success :(

Here is what I got :

gdb python3                                                                                           SIGSEGV(11) ↵  10109  22:38:24
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
(No debugging symbols found in python3)
(gdb) run test
Starting program: /usr/bin/python3 test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 8629]
[New Thread 0x7fff9e77c700 (LWP 8630)]
[New Thread 0x7fff9df7b700 (LWP 8631)]
[New Thread 0x7fff9b77a700 (LWP 8632)]
[New Thread 0x7fff96f79700 (LWP 8633)]
[New Thread 0x7fff94778700 (LWP 8634)]
[New Thread 0x7fff91f77700 (LWP 8635)]
[New Thread 0x7fff8f776700 (LWP 8636)]
[New Thread 0x7fff8be58700 (LWP 8637)]
2020-11-11 22:38:47.726848: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so
[New Thread 0x7fff8b259700 (LWP 8640)]
2020-11-11 22:38:48.176494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]     ROCm AMD GPU ISA: gfx803
coreClock: 1.279GHz coreCount: 36 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 0B/s
2020-11-11 22:38:48.442703: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocblas.so
2020-11-11 22:38:48.458048: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libMIOpen.so
2020-11-11 22:38:48.488252: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocfft.so
2020-11-11 22:38:48.490385: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocrand.so
2020-11-11 22:38:48.490462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-11-11 22:38:48.490649: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[New Thread 0x7fff838eb700 (LWP 8643)]
[New Thread 0x7fff830ea700 (LWP 8644)]
[New Thread 0x7fff828e9700 (LWP 8645)]
[New Thread 0x7fff820e8700 (LWP 8646)]
[New Thread 0x7fff818e7700 (LWP 8647)]
[New Thread 0x7fff810e6700 (LWP 8648)]
[New Thread 0x7fff808e5700 (LWP 8649)]
[New Thread 0x7fff37d67700 (LWP 8650)]
[New Thread 0x7fff37566700 (LWP 8651)]
[New Thread 0x7fff36d65700 (LWP 8652)]
[New Thread 0x7fff36564700 (LWP 8653)]
2020-11-11 22:38:48.495101: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3999980000 Hz
[Thread 0x7fff36d65700 (LWP 8652) exited]
[New Thread 0x7fff36d65700 (LWP 8654)]
[New Thread 0x7fff35d63700 (LWP 8655)]
[New Thread 0x7fff35562700 (LWP 8656)]
[New Thread 0x7fff34d61700 (LWP 8657)]
[New Thread 0x7ffefffff700 (LWP 8658)]
[New Thread 0x7ffeff7fe700 (LWP 8659)]
[New Thread 0x7ffefeffd700 (LWP 8660)]
[New Thread 0x7ffefe7fc700 (LWP 8661)]
2020-11-11 22:38:48.495740: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fff8807ea10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-11 22:38:48.495751: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
[New Thread 0x7ffefdffb700 (LWP 8662)]
[New Thread 0x7ffefd7fa700 (LWP 8663)]
[New Thread 0x7ffefcff9700 (LWP 8664)]
[New Thread 0x7ffedbfff700 (LWP 8665)]
2020-11-11 22:38:48.496950: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fff3446bc10 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
[Thread 0x7ffefcff9700 (LWP 8664) exited]
2020-11-11 22:38:48.496960: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], AMDGPU ISA version: gfx803
2020-11-11 22:38:48.497021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]     ROCm AMD GPU ISA: gfx803
coreClock: 1.279GHz coreCount: 36 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 0B/s
2020-11-11 22:38:48.497048: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocblas.so
2020-11-11 22:38:48.497058: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libMIOpen.so
2020-11-11 22:38:48.497066: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocfft.so
2020-11-11 22:38:48.497077: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocrand.so
2020-11-11 22:38:48.497111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-11-11 22:38:49.278623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-11 22:38:49.278649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2020-11-11 22:38:49.278654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2020-11-11 22:38:49.278779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7399 MB memory) -> physical GPU (device: 0, name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], pci bus id: 0000:01:00.0)
[New Thread 0x7ffefcff9700 (LWP 8666)]
[Thread 0x7ffefcff9700 (LWP 8666) exited]
[New Thread 0x7fff800e4700 (LWP 8667)]
[New Thread 0x7fff800a0700 (LWP 8668)]
[New Thread 0x7ffefcff9700 (LWP 8669)]
[New Thread 0x7ffeda5c5700 (LWP 8670)]
[New Thread 0x7ffed9dc4700 (LWP 8671)]
[New Thread 0x7ffed95c3700 (LWP 8672)]
[New Thread 0x7ffed8dc2700 (LWP 8673)]
[New Thread 0x7ffebbfff700 (LWP 8674)]
[New Thread 0x7ffeb37fe700 (LWP 8675)]
[New Thread 0x7ffebb7fe700 (LWP 8676)]
[New Thread 0x7ffebaffd700 (LWP 8677)]
[New Thread 0x7ffeba7fc700 (LWP 8678)]
[New Thread 0x7fff8005c700 (LWP 8679)]
[New Thread 0x7fff34460700 (LWP 8680)]
--Type <RET> for more, q to quit, c to continue without paging--

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007fffc155bfb0 in device::WaveLimiterManager::getWavesPerSH(device::VirtualDevice const*) const () from /usr/lib/libamdhip64.so.3

@Grench6
Copy link

Grench6 commented Nov 12, 2020

@AsimPoptani Where you able to run the benchmark? Or at least of making the 5 + 2 operation with TF-rocm?

@rkothako
Copy link

Hi @AsimPoptani, Looks like you are missing something.
Please share the step by step procedure you followed.

@cmal
Copy link

cmal commented Nov 12, 2020

I am getting this too ...
OS: Ubuntu-20.04
CPU: i7 4970
GPU: RX580 8G
Python: 3.6.12

@Djip007
Copy link

Djip007 commented Nov 14, 2020

/data/jenkins_workspace/centos_pipeline_job_8.1_rel-3.9/rocm-rel-3.9/rocm-3.9-19-20201111/8.1/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

same error with last 3.9.1 for me...

@rkothako
Copy link

As this ticket is already closed, request to open a new ticket with all detailed steps to reproduce, to discuss there.
Thank you.

@mathmax12
Copy link

mathmax12 commented Nov 20, 2020

@xuhuisheng

UPDATE 2020-11-05: The reason is rocsparse is not compiled with gfx803, after compile rocsparse with AMDGPU_TARGETS=gfx803, and reinstalled the custom rocsparse package, this problem solved.

It is a bug on rocSPARSE cmake config, the AMDGPU_TARGETS never by used.

Pull Request had been merged. ROCmSoftwarePlatform/rocSPARSE#213

#1265 is still there.

I got the similar issue in one pyorch docker provided by https://hub.docker.com/r/rocm/pytorch/tags
Could you please show the command you used to compile the rocSPARSE?
Thanks

@xuhuisheng
Copy link
Contributor Author

@mathmax12
Since issue had been closed, I wrote a doc for gfx803 issue details
https://github.com/xuhuisheng/rocm-build/blob/develop/docs/gfx803.md

@mathmax12
Copy link

mathmax12 commented Nov 21, 2020

Thanks a lot for that.

I tried to change two CMake file according to the patch.
ROcSPARSE_branch rocm3.9x
/library/CMakeLists.txt

# Target compile options
foreach(target ${AMDGPU_TARGETS})
  target_compile_options(rocsparse PRIVATE --amdgpu-target=${target}:xnack-)
endforeach()

After run ./install.sh -di I got this :


CMake Error at /opt/cmake-3.18.1-Linux-x86_64/share/cmake-3.18/Modules/CMakeDetermineCXXCompiler.cmake:48 (message):
  Could not find compiler set in environment variable CXX:

  hipcc.

Call Stack (most recent call first):
  CMakeLists.txt:54 (project)


CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_Fortran_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!

Did I miss something?

@xuhuisheng
Copy link
Contributor Author

@mathmax12
You have to install rocm-dev and rocm-libs first.
Then cmake will find CXX=/opt/rocm/bin/hipcc automatically.

Or CXX=/opt/rocm/bin/hipcc ./install.sh -di to specify the hipcc path.

@borgarpa
Copy link

borgarpa commented Nov 25, 2020

@VegetaDTX No problem, I have already tested it, and downgrading works as expected
I wrote a mini-guide here to downgrade and install Rocm, tensorflow-rocm, and test it with a benchmark.

Hey! Thanks for your guide.

I got the following results after running:

/opt/rocm/bin/rocminfo
/opt/rocm/opencl/bin/clinfo

rocm_clinfo.txt
rocm_rocminfo.txt

I followed it step by step, but I couldn't get it working... When I run the following benchmark python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8 --model=resnet50 I get this error rocm_OOM.txt, which is quite weird since a 8 Gb GPU should be able to handle a ResNet50.
Furthermore, when I run rocm-smi, I get this weird result:

/opt/rocm/bin/rocm-smi:816: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if clocktype is 'freq':
/opt/rocm/bin/rocm-smi:901: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if component is 'driver':
/opt/rocm/bin/rocm-smi:923: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if (retiredType is 'all' or \
/opt/rocm/bin/rocm-smi:924: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'retired' and pgType is 'R' or \
/opt/rocm/bin/rocm-smi:924: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'retired' and pgType is 'R' or \
/opt/rocm/bin/rocm-smi:925: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'pending' and pgType is 'P' or \
/opt/rocm/bin/rocm-smi:925: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'pending' and pgType is 'P' or \
/opt/rocm/bin/rocm-smi:926: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'unreservable' and pgType is 'F'):
/opt/rocm/bin/rocm-smi:926: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'unreservable' and pgType is 'F'):
/opt/rocm/bin/rocm-smi:1501: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if component is 'driver':
/opt/rocm/bin/rocm-smi:1938: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if ptype is 'R':
/opt/rocm/bin/rocm-smi:1940: SyntaxWarning: "is" with a literal. Did you mean "=="?
  elif ptype is 'P':
/opt/rocm/bin/rocm-smi:2395: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if clkType is 'sclk':
/opt/rocm/bin/rocm-smi:2397: SyntaxWarning: "is" with a literal. Did you mean "=="?
  elif clkType is 'mclk':


========================ROCm System Management Interface========================
================================================================================
GPU  Temp   AvgPwr   SCLK     MCLK     Fan     Perf  PwrCap  VRAM%  GPU%  
0    34.0c  41.028W  1233Mhz  1000Mhz  23.92%  auto  120.0W   93%   0%    
================================================================================
==============================End of ROCm SMI Log ==============================

Any idea why might this be?

EDIT: I sorted the OOM problem out by following the solutions posted in this issue tensorflow/tensorflow/issues/40751. However, the rocm-smi weird behaviour still remains.

Besides, the GPU bandwidth seems to be ridiculously small... coreClock: 1.268GHz coreCount: 32 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: -1B/s Is it normal?

@Grench6
Copy link

Grench6 commented Nov 25, 2020

@borgarpa Syntax warnings of rocm-smi are normal (at least for me). I don't remember if that bandwidth is normal tbh... maybe you should try this to test it out:

sudo apt-get install rocm-bandwidth-test
rocm-bandwidth-test

@borgarpa
Copy link

@Grench6 Thanks for the tip. I run the bandwidth test and this is the result:

........
          RocmBandwidthTest Version: 2.3.11

          Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)


          Device: 0,  AMD Ryzen 5 1600X Six-Core Processor
          Device: 1,  Ellesmere [Radeon RX 470/480/570/570X/580/580X/590],  1f:0.0

          Inter-Device Access

          D/D       0         1         

          0         1         0         

          1         1         1         


          Inter-Device Numa Distance

          D/D       0         1         

          0         0         N/A       

          1         20        0         


          Unidirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         11.238147   

          1         7.108366    25.104243   


          Bdirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         14.147676   

          1         14.147676   N/A

Is it normal that the CPU cannot access the GPU in the Inter-Device Access test?

@ROCmSupport
Copy link

Hi All,
As this ticket is already in closed state, recommend to not to move this.
Request to file any other issue as a separate ticket.
Thanks for understanding.

@Djip007
Copy link

Djip007 commented Dec 2, 2020

I now this is close only for report curent status of the patch:
with cento8+rocm 3.10:

/data/jenkins_workspace/centos_pipeline_job_8.1_rel-3.10/rocm-rel-3.10/rocm-3.10-27-20201120/8.1/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

so the patch is still needed for this version ;)

@xuhuisheng xuhuisheng changed the title ROCm-3.9 crash with gfx803 ROCm-3.9, ROCm-3.10 crash with gfx803 Dec 10, 2020
@staticdev
Copy link

@xuhuisheng your link in the description is broken, correct is https://github.com/xuhuisheng/rocm-build/blob/master/docs/gfx803.md

@da3dsoul
Copy link

Updated links to info from xuhuisheng. Thanks xuhuisheng. I've not tried it yet, but you guys definitely left a trail of things to try.
https://github.com/xuhuisheng/rocm-build/tree/master/gfx803
https://github.com/xuhuisheng/rocm-gfx803

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet