ROCm-3.9, ROCm-3.10 crash with gfx803 #1269

xuhuisheng · 2020-10-29T00:55:28Z

If you installed ROCm-3.9, ROCm-3.10 with gfx803, you will crash on very beginning of running tensorflow or pytorch.
Error info as follows:

work@0b7758c3094d:~/test/examples/mnist$ python3 main.py
/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")
Aborted (core dumped)

OS: Ubuntu-20.04
CPU: Xeon 2620v3
GPU: RX580 8G (Polaris10) CHIP ID: 0x67df
Python: 3.8.5
Tensorflow-rocm: 2.3.1

hip sample run ok.

work@0b7758c3094d:~/test$ make
/opt/rocm/hip/bin/hipcc  square.cpp -o square.out
work@0b7758c3094d:~/test$ ./square.out
info: running on device Device 67df
info: allocate host mem (  7.63 MB)
info: allocate device mem (  7.63 MB)
info: copy Host2Device
info: launch 'vector_square' kernel
info: copy Device2Host
info: check result
PASSED!

UPDATE 2020-11-05: The reason is rocsparse is not compiled with gfx803, after compile rocsparse with AMDGPU_TARGETS=gfx803, and reinstalled the custom rocsparse package, this problem solved.

It is a bug on rocSPARSE cmake config, the AMDGPU_TARGETS never by used.

Pull Request had been merged. ROCm/rocSPARSE#213

#1265 is still there.

UPDATE 2020-11-21: wrote a doc for gfx803 issues detals.
https://github.com/xuhuisheng/rocm-build/blob/develop/docs/gfx803.md

The text was updated successfully, but these errors were encountered:

AsimPoptani · 2020-10-29T22:17:15Z

I am getting this too ...

OS: Ubuntu 18.04 LTS
CPU: i76700k
GPU RX480
Tf-R 2.3.1

Grench6 · 2020-10-30T19:00:03Z

Same here... (btw, there is a typo in word Coudn't)

/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!") Aborted (core dumped)

OS: Ubuntu-20.04.1 LTS
CPU: Intel i3-6100
GPU: RX580
Python: 3.8.5
Tensorflow-rocm: 2.3.1

I followed the guide AMD provided https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html twice, both times in a fresh Ubuntu installation.

rocminfo and clinfo seem to be working properly, I will attach the command output bellow
rocminfo.pdf
clinfo.pdf

I noticed each time you exit a python interactive session where tensorflow was imported it threw the exact same error:
python_import_TF.pdf

I also tried the guide https://www.videogames.ai/Install-ROCM-Machine-Learning-AMD-GPU along with the video (which is more complete) https://www.youtube.com/watch?v=fkSRkAoMS4g without any success. (It fails the same way when you try to run the tf_cnn_benchmarks.py script)

angimenez · 2020-10-30T23:09:23Z

I am getting this too ...

OS: Ubuntu 18.04 LTS
CPU: Ryzen 7 2700x
GPU RX580
Tf-R 2.3.2

Djip007 · 2020-10-31T11:55:41Z

same here:
OS: Centos 8.2
CPU: Ryzen 9 3900x
GPU RX480
Tf-R 2.3.2

I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so
/data/jenkins_workspace/centos_pipeline_job_8.1_rel-3.9/rocm-rel-3.9/rocm-3.9-17-20201021/8.1/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!"

Djip007 · 2020-10-31T12:10:24Z

using last docker have the same error:

I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so
/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

VegetaDTX · 2020-11-01T16:16:37Z

Hello there good folks of github! I have the exact same problem.
I spent half of the day setting up the environment for working with GPU powered Tensorflow projects and I get this error in the end that I just can't seem to find a solution to :(

OS: Ubuntu 20.04.1 LTS
CPU: AMD® Ryzen 5 1600x six-core processor × 12
GPU: Radeon RX 570 Series (POLARIS10, DRM 3.40.0, 5.4.0-52-generic, LLVM 10.0.0)
Tensorflow-rocm version: 2.3.2

I hate to sound negative, but things like these seriously make me want to give up techy things once and for all and just go become a professional shepard...

rkothako · 2020-11-02T08:16:43Z

Hi @xuhuisheng and others,
Thanks for the issue. Let me check on this.

xuhuisheng · 2020-11-02T08:22:05Z

@rkothako Thank you for replying. Please also check this issue #1265 . The ROCm-3.7 and ROCm-3.8 cannot run on gfx803 correctly. While ROCm-3.9 totally crashed with gfx803.

VegetaDTX · 2020-11-02T08:59:49Z

@rkothako, thanks for your response. Hereby I confirm that, just like @Grench6 , I have tried running tf_cnn_benchmarks.py and got the same error there.

AsimPoptani · 2020-11-02T09:01:39Z

@rkothako is there anyway we can help you further to solve these issues?

AsimPoptani · 2020-11-02T13:53:32Z

@rkothako any updates?

xuhuisheng · 2020-11-03T03:10:12Z

@AsimPoptani My adivise is downgrade to ROCm-3.5.1 with gfx803. There are other issues on for ROCm-3.7 and ROCm-3.8 on gfx803. please refer here : #1265
The response from AMD wont come so quick.

AsimPoptani · 2020-11-03T04:11:24Z

How does one downgrade? @xuhuisheng

rkothako · 2020-11-03T04:45:11Z

Hi @AsimPoptani
Please use space of http://repo.radeon.com/rocm/apt/3.5.1/ for installing ROCm 3.5.1

AsimPoptani · 2020-11-03T05:20:58Z

This is what I did :
sudo apt autoremove rocm-opencl rocm-dkms rocm-dev rocm-utils
`wget -q -O - http://repo.radeon.com/rocm/apt/3.5.1/rocm.gpg.key | sudo apt-key add -

echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/3.5.1/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install rocm-dkms=3.5.*

`
However I get:

`Selected version '3.5.1-34' (repo.radeon.com:3.5.1/Ubuntu 16.04 [amd64]) for 'rocm-dkms'
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies.
rocm-dkms : Depends: rocm-dev but it is not going to be installed
`

rkothako · 2020-11-03T06:30:17Z

Hi @AsimPoptani
On a clean (or) properly ROCm uninstalled machine, follow the below steps to install ROCm 3.5.1.

wget -q -O - http://repo.radeon.com/rocm/apt/3.5.1/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/3.5.1/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install rocm-dkms

VegetaDTX · 2020-11-03T13:59:53Z

@rkothako I will try this soon hopefully and will let you know if I succeeded.

AsimPoptani · 2020-11-03T17:08:26Z

Hi @rkothako tried that ... However i get this:

>>> tf.Variable('x',1) 2020-11-03 17:07:32.586903: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2020-11-03 17:07:32.624439: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3999980000 Hz 2020-11-03 17:07:32.625269: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb07a7352b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-11-03 17:07:32.625311: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-11-03 17:07:32.629109: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb07a7cb590 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices: 2020-11-03 17:07:32.629154: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], AMDGPU ISA version: gfx803 2020-11-03 17:07:32.629339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: pciBusID: 0000:01:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] ROCm AMD GPU ISA: gfx803 coreClock: 1.279GHz coreCount: 36 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 0B/s 2020-11-03 17:07:32.629406: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocblas.so 2020-11-03 17:07:32.629447: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libMIOpen.so 2020-11-03 17:07:32.629485: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocfft.so 2020-11-03 17:07:32.629523: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocrand.so 2020-11-03 17:07:32.629635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2020-11-03 17:07:33.418980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-11-03 17:07:33.419003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 2020-11-03 17:07:33.419007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N 2020-11-03 17:07:33.419120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7399 MB memory) -> physical GPU (device: 0, name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], pci bus id: 0000:01:00.0) [1] 108049 segmentation fault (core dumped) python3

Grench6 · 2020-11-04T23:11:00Z

@VegetaDTX success?

AMDGPU_TARGETS marked as cache string. When after include Dependencies.cmake, AMDGPU_TARGETS always get cached variable gfx900;gfx906;gfx908, Its means never used AMDGPU_TARGETS. This caused ROCm3.9 crashed on gfx803. ROCm/ROCm#1269

xuhuisheng · 2020-11-05T08:24:26Z

The pull request of rocSPARSE had been merged. Local checked successfully.
I will close this issue and hope it will be released soon.

angimenez · 2020-11-06T02:46:24Z

Thank you @xuhuisheng
I've downloaded the rocsparse repository and compiled it (including the change you've made), but tensorflow still doesn't work.
Commands:
sudo rm -rf /opt/rocm/rocsparse
mkdir -p build/release; cd build/release
CXX = /opt/rocm/bin/hipcc cmake -DBUILD_CLIENTS_TESTS=ON ../ ..
make
sudo make install
.
.
.
.
The installation was success full, but I have had the same error in TensorFlow.
Are the steps I have followed correct to compile rocsparse ?? Thanks a lot

xuhuisheng · 2020-11-06T03:01:48Z

@angimenez
Please use rocm-3.9.0 tag or rocm-3.10.x branch. The develop branch is WIP.
I will check the latest codes later.

angimenez · 2020-11-06T04:34:45Z

@xuhuisheng
Now it worked for me, thank you very much

xuhuisheng · 2020-11-06T05:09:48Z

@angimenez
Found a comment about gfx803 on rocSPARSE develop branch. ROCm/rocSPARSE@f8791e9#commitcomment-43334853
We should wait AMD to fix it.

update: fixed ROCm/rocSPARSE@7de1594

VegetaDTX · 2020-11-11T12:51:06Z

@Grench6

@VegetaDTX success?

I apologize for the delayed reply but I was too busy with other stuff and it's also quite inconvenient for me to try it on Ubuntu, because I have a dual boot and most of my other ML work is not on Ubuntu. So far I didn't have luck but I haven't tried the latest advice by @xuhuisheng xuhuisheng yet.
When I get enough time and try it, I'll post my experience here.

Grench6 · 2020-11-11T14:58:50Z

@VegetaDTX No problem, I have already tested it, and downgrading works as expected 😃
I wrote a mini-guide here to downgrade and install Rocm, tensorflow-rocm, and test it with a benchmark.

VegetaDTX · 2020-11-11T20:40:32Z

@Grench6 Thanks so much for the guide! I am so glad it works. I'll try it as soon as I get some time. I really need it for some of my projects!

AsimPoptani · 2020-11-11T22:40:48Z

@Grench6 I followed your guide, unfortunately, no success :(

Here is what I got :

gdb python3                                                                                           SIGSEGV(11) ↵  10109  22:38:24
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
(No debugging symbols found in python3)
(gdb) run test
Starting program: /usr/bin/python3 test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 8629]
[New Thread 0x7fff9e77c700 (LWP 8630)]
[New Thread 0x7fff9df7b700 (LWP 8631)]
[New Thread 0x7fff9b77a700 (LWP 8632)]
[New Thread 0x7fff96f79700 (LWP 8633)]
[New Thread 0x7fff94778700 (LWP 8634)]
[New Thread 0x7fff91f77700 (LWP 8635)]
[New Thread 0x7fff8f776700 (LWP 8636)]
[New Thread 0x7fff8be58700 (LWP 8637)]
2020-11-11 22:38:47.726848: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so
[New Thread 0x7fff8b259700 (LWP 8640)]
2020-11-11 22:38:48.176494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]     ROCm AMD GPU ISA: gfx803
coreClock: 1.279GHz coreCount: 36 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 0B/s
2020-11-11 22:38:48.442703: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocblas.so
2020-11-11 22:38:48.458048: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libMIOpen.so
2020-11-11 22:38:48.488252: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocfft.so
2020-11-11 22:38:48.490385: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocrand.so
2020-11-11 22:38:48.490462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-11-11 22:38:48.490649: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[New Thread 0x7fff838eb700 (LWP 8643)]
[New Thread 0x7fff830ea700 (LWP 8644)]
[New Thread 0x7fff828e9700 (LWP 8645)]
[New Thread 0x7fff820e8700 (LWP 8646)]
[New Thread 0x7fff818e7700 (LWP 8647)]
[New Thread 0x7fff810e6700 (LWP 8648)]
[New Thread 0x7fff808e5700 (LWP 8649)]
[New Thread 0x7fff37d67700 (LWP 8650)]
[New Thread 0x7fff37566700 (LWP 8651)]
[New Thread 0x7fff36d65700 (LWP 8652)]
[New Thread 0x7fff36564700 (LWP 8653)]
2020-11-11 22:38:48.495101: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3999980000 Hz
[Thread 0x7fff36d65700 (LWP 8652) exited]
[New Thread 0x7fff36d65700 (LWP 8654)]
[New Thread 0x7fff35d63700 (LWP 8655)]
[New Thread 0x7fff35562700 (LWP 8656)]
[New Thread 0x7fff34d61700 (LWP 8657)]
[New Thread 0x7ffefffff700 (LWP 8658)]
[New Thread 0x7ffeff7fe700 (LWP 8659)]
[New Thread 0x7ffefeffd700 (LWP 8660)]
[New Thread 0x7ffefe7fc700 (LWP 8661)]
2020-11-11 22:38:48.495740: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fff8807ea10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-11 22:38:48.495751: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
[New Thread 0x7ffefdffb700 (LWP 8662)]
[New Thread 0x7ffefd7fa700 (LWP 8663)]
[New Thread 0x7ffefcff9700 (LWP 8664)]
[New Thread 0x7ffedbfff700 (LWP 8665)]
2020-11-11 22:38:48.496950: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fff3446bc10 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
[Thread 0x7ffefcff9700 (LWP 8664) exited]
2020-11-11 22:38:48.496960: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], AMDGPU ISA version: gfx803
2020-11-11 22:38:48.497021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]     ROCm AMD GPU ISA: gfx803
coreClock: 1.279GHz coreCount: 36 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 0B/s
2020-11-11 22:38:48.497048: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocblas.so
2020-11-11 22:38:48.497058: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libMIOpen.so
2020-11-11 22:38:48.497066: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocfft.so
2020-11-11 22:38:48.497077: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocrand.so
2020-11-11 22:38:48.497111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-11-11 22:38:49.278623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-11 22:38:49.278649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2020-11-11 22:38:49.278654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2020-11-11 22:38:49.278779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7399 MB memory) -> physical GPU (device: 0, name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], pci bus id: 0000:01:00.0)
[New Thread 0x7ffefcff9700 (LWP 8666)]
[Thread 0x7ffefcff9700 (LWP 8666) exited]
[New Thread 0x7fff800e4700 (LWP 8667)]
[New Thread 0x7fff800a0700 (LWP 8668)]
[New Thread 0x7ffefcff9700 (LWP 8669)]
[New Thread 0x7ffeda5c5700 (LWP 8670)]
[New Thread 0x7ffed9dc4700 (LWP 8671)]
[New Thread 0x7ffed95c3700 (LWP 8672)]
[New Thread 0x7ffed8dc2700 (LWP 8673)]
[New Thread 0x7ffebbfff700 (LWP 8674)]
[New Thread 0x7ffeb37fe700 (LWP 8675)]
[New Thread 0x7ffebb7fe700 (LWP 8676)]
[New Thread 0x7ffebaffd700 (LWP 8677)]
[New Thread 0x7ffeba7fc700 (LWP 8678)]
[New Thread 0x7fff8005c700 (LWP 8679)]
[New Thread 0x7fff34460700 (LWP 8680)]
--Type <RET> for more, q to quit, c to continue without paging--

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007fffc155bfb0 in device::WaveLimiterManager::getWavesPerSH(device::VirtualDevice const*) const () from /usr/lib/libamdhip64.so.3

Grench6 · 2020-11-12T05:41:07Z

@AsimPoptani Where you able to run the benchmark? Or at least of making the 5 + 2 operation with TF-rocm?

rkothako · 2020-11-12T05:42:37Z

Hi @AsimPoptani, Looks like you are missing something.
Please share the step by step procedure you followed.

cmal · 2020-11-12T12:54:30Z

I am getting this too ...
OS: Ubuntu-20.04
CPU: i7 4970
GPU: RX580 8G
Python: 3.6.12

Djip007 · 2020-11-14T00:02:06Z

/data/jenkins_workspace/centos_pipeline_job_8.1_rel-3.9/rocm-rel-3.9/rocm-3.9-19-20201111/8.1/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

same error with last 3.9.1 for me...

rkothako · 2020-11-16T06:15:17Z

As this ticket is already closed, request to open a new ticket with all detailed steps to reproduce, to discuss there.
Thank you.

mathmax12 · 2020-11-20T20:35:30Z

@xuhuisheng

UPDATE 2020-11-05: The reason is rocsparse is not compiled with gfx803, after compile rocsparse with AMDGPU_TARGETS=gfx803, and reinstalled the custom rocsparse package, this problem solved.

It is a bug on rocSPARSE cmake config, the AMDGPU_TARGETS never by used.

Pull Request had been merged. ROCmSoftwarePlatform/rocSPARSE#213

#1265 is still there.

I got the similar issue in one pyorch docker provided by https://hub.docker.com/r/rocm/pytorch/tags
Could you please show the command you used to compile the rocSPARSE?
Thanks

xuhuisheng · 2020-11-20T23:14:18Z

@mathmax12
Since issue had been closed, I wrote a doc for gfx803 issue details
https://github.com/xuhuisheng/rocm-build/blob/develop/docs/gfx803.md

mathmax12 · 2020-11-21T02:17:31Z

Thanks a lot for that.

I tried to change two CMake file according to the patch.
ROcSPARSE_branch rocm3.9x
/library/CMakeLists.txt

# Target compile options
foreach(target ${AMDGPU_TARGETS})
  target_compile_options(rocsparse PRIVATE --amdgpu-target=${target}:xnack-)
endforeach()

After run ./install.sh -di I got this :


CMake Error at /opt/cmake-3.18.1-Linux-x86_64/share/cmake-3.18/Modules/CMakeDetermineCXXCompiler.cmake:48 (message):
  Could not find compiler set in environment variable CXX:

  hipcc.

Call Stack (most recent call first):
  CMakeLists.txt:54 (project)


CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_Fortran_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!

Did I miss something?

xuhuisheng · 2020-11-21T03:00:22Z

@mathmax12
You have to install rocm-dev and rocm-libs first.
Then cmake will find CXX=/opt/rocm/bin/hipcc automatically.

Or CXX=/opt/rocm/bin/hipcc ./install.sh -di to specify the hipcc path.

borgarpa · 2020-11-25T13:30:54Z

@VegetaDTX No problem, I have already tested it, and downgrading works as expected
I wrote a mini-guide here to downgrade and install Rocm, tensorflow-rocm, and test it with a benchmark.

Hey! Thanks for your guide.

I got the following results after running:

/opt/rocm/bin/rocminfo
/opt/rocm/opencl/bin/clinfo

rocm_clinfo.txt
rocm_rocminfo.txt

I followed it step by step, but I couldn't get it working... When I run the following benchmark python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8 --model=resnet50 I get this error rocm_OOM.txt, which is quite weird since a 8 Gb GPU should be able to handle a ResNet50.
Furthermore, when I run rocm-smi, I get this weird result:

/opt/rocm/bin/rocm-smi:816: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if clocktype is 'freq':
/opt/rocm/bin/rocm-smi:901: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if component is 'driver':
/opt/rocm/bin/rocm-smi:923: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if (retiredType is 'all' or \
/opt/rocm/bin/rocm-smi:924: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'retired' and pgType is 'R' or \
/opt/rocm/bin/rocm-smi:924: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'retired' and pgType is 'R' or \
/opt/rocm/bin/rocm-smi:925: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'pending' and pgType is 'P' or \
/opt/rocm/bin/rocm-smi:925: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'pending' and pgType is 'P' or \
/opt/rocm/bin/rocm-smi:926: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'unreservable' and pgType is 'F'):
/opt/rocm/bin/rocm-smi:926: SyntaxWarning: "is" with a literal. Did you mean "=="?
  retiredType is 'unreservable' and pgType is 'F'):
/opt/rocm/bin/rocm-smi:1501: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if component is 'driver':
/opt/rocm/bin/rocm-smi:1938: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if ptype is 'R':
/opt/rocm/bin/rocm-smi:1940: SyntaxWarning: "is" with a literal. Did you mean "=="?
  elif ptype is 'P':
/opt/rocm/bin/rocm-smi:2395: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if clkType is 'sclk':
/opt/rocm/bin/rocm-smi:2397: SyntaxWarning: "is" with a literal. Did you mean "=="?
  elif clkType is 'mclk':


========================ROCm System Management Interface========================
================================================================================
GPU  Temp   AvgPwr   SCLK     MCLK     Fan     Perf  PwrCap  VRAM%  GPU%  
0    34.0c  41.028W  1233Mhz  1000Mhz  23.92%  auto  120.0W   93%   0%    
================================================================================
==============================End of ROCm SMI Log ==============================

Any idea why might this be?

EDIT: I sorted the OOM problem out by following the solutions posted in this issue tensorflow/tensorflow/issues/40751. However, the rocm-smi weird behaviour still remains.

Besides, the GPU bandwidth seems to be ridiculously small... coreClock: 1.268GHz coreCount: 32 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: -1B/s Is it normal?

Grench6 · 2020-11-25T21:57:46Z

@borgarpa Syntax warnings of rocm-smi are normal (at least for me). I don't remember if that bandwidth is normal tbh... maybe you should try this to test it out:

sudo apt-get install rocm-bandwidth-test
rocm-bandwidth-test

borgarpa · 2020-11-25T22:47:26Z

@Grench6 Thanks for the tip. I run the bandwidth test and this is the result:

........
          RocmBandwidthTest Version: 2.3.11

          Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)


          Device: 0,  AMD Ryzen 5 1600X Six-Core Processor
          Device: 1,  Ellesmere [Radeon RX 470/480/570/570X/580/580X/590],  1f:0.0

          Inter-Device Access

          D/D       0         1         

          0         1         0         

          1         1         1         


          Inter-Device Numa Distance

          D/D       0         1         

          0         0         N/A       

          1         20        0         


          Unidirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         11.238147   

          1         7.108366    25.104243   


          Bdirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         14.147676   

          1         14.147676   N/A

Is it normal that the CPU cannot access the GPU in the Inter-Device Access test?

ROCmSupport · 2020-11-26T05:27:33Z

Hi All,
As this ticket is already in closed state, recommend to not to move this.
Request to file any other issue as a separate ticket.
Thanks for understanding.

Djip007 · 2020-12-02T00:50:04Z

I now this is close only for report curent status of the patch:
with cento8+rocm 3.10:

/data/jenkins_workspace/centos_pipeline_job_8.1_rel-3.10/rocm-rel-3.10/rocm-3.10-27-20201120/8.1/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

so the patch is still needed for this version ;)

staticdev · 2021-03-19T16:45:26Z

@xuhuisheng your link in the description is broken, correct is https://github.com/xuhuisheng/rocm-build/blob/master/docs/gfx803.md

da3dsoul · 2021-08-31T02:17:24Z

Updated links to info from xuhuisheng. Thanks xuhuisheng. I've not tried it yet, but you guys definitely left a trail of things to try.
https://github.com/xuhuisheng/rocm-build/tree/master/gfx803
https://github.com/xuhuisheng/rocm-gfx803

Grench6 mentioned this issue Oct 31, 2020

Is the RX 590 compatible with WSL2? #1249

Closed

xuhuisheng mentioned this issue Nov 5, 2020

move AMDGPU_TARGETS before include Dependencies ROCm/rocSPARSE#213

Merged

xuhuisheng closed this as completed Nov 5, 2020

xuhuisheng mentioned this issue Nov 5, 2020

Failed run tensorflow 2.3.0 with rocm 3.9.0(/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")) ROCm/HIP#2187

Closed

xuhuisheng mentioned this issue Nov 15, 2020

ROCm 3.9.1 repo broken on ubuntu 18.04.5LTS? #1289

Closed

xuhuisheng mentioned this issue Nov 20, 2020

hip_code_object.cpp:92: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!") ROCm/tensorflow-upstream#1106

Open

mathmax12 mentioned this issue Nov 20, 2020

Rocm 3.9 version inconsistent between PyTorch with ROCm docker and Ubuntu #1299

Closed

unexploredtest mentioned this issue Nov 27, 2020

Q: What is status of ROCm on RX 5500 XT? #1306

Closed

borgarpa mentioned this issue Nov 30, 2020

ROCm 3.5.1 - Radeon RX 570 GPU bandwidth too low #1309

Closed

xuhuisheng mentioned this issue Dec 10, 2020

Please add gfx803 to AMDGPU_TARGETS ROCm/rocRAND#159

Closed

xuhuisheng changed the title ~~ROCm-3.9 crash with gfx803~~ ROCm-3.9, ROCm-3.10 crash with gfx803 Dec 10, 2020

Krastanov mentioned this issue Feb 2, 2021

test failures and crashes on 580 JuliaGPU/AMDGPU.jl#92

Closed

yjwong mentioned this issue Jul 11, 2021

Move AMDGPU_TARGETS to before including dependencies ROCm/AMDMIGraphX#878

Closed

ROCm-3.9, ROCm-3.10 crash with gfx803 #1269

ROCm-3.9, ROCm-3.10 crash with gfx803 #1269

Comments

xuhuisheng commented Oct 29, 2020 • edited Loading

AsimPoptani commented Oct 29, 2020

Grench6 commented Oct 30, 2020 • edited Loading

angimenez commented Oct 30, 2020

Djip007 commented Oct 31, 2020

Djip007 commented Oct 31, 2020

VegetaDTX commented Nov 1, 2020 • edited Loading

rkothako commented Nov 2, 2020

xuhuisheng commented Nov 2, 2020

VegetaDTX commented Nov 2, 2020 • edited Loading

AsimPoptani commented Nov 2, 2020

AsimPoptani commented Nov 2, 2020

xuhuisheng commented Nov 3, 2020

AsimPoptani commented Nov 3, 2020 • edited Loading

rkothako commented Nov 3, 2020

AsimPoptani commented Nov 3, 2020

rkothako commented Nov 3, 2020

VegetaDTX commented Nov 3, 2020 • edited Loading

AsimPoptani commented Nov 3, 2020

Grench6 commented Nov 4, 2020

xuhuisheng commented Nov 5, 2020

angimenez commented Nov 6, 2020

xuhuisheng commented Nov 6, 2020 • edited Loading

angimenez commented Nov 6, 2020

xuhuisheng commented Nov 6, 2020 • edited Loading

VegetaDTX commented Nov 11, 2020

Grench6 commented Nov 11, 2020 • edited Loading

VegetaDTX commented Nov 11, 2020

AsimPoptani commented Nov 11, 2020 • edited Loading

Grench6 commented Nov 12, 2020

rkothako commented Nov 12, 2020

cmal commented Nov 12, 2020

Djip007 commented Nov 14, 2020

rkothako commented Nov 16, 2020

mathmax12 commented Nov 20, 2020 • edited Loading

xuhuisheng commented Nov 20, 2020

mathmax12 commented Nov 21, 2020 • edited Loading

xuhuisheng commented Nov 21, 2020

borgarpa commented Nov 25, 2020 • edited Loading

Grench6 commented Nov 25, 2020

borgarpa commented Nov 25, 2020

ROCmSupport commented Nov 26, 2020

Djip007 commented Dec 2, 2020

staticdev commented Mar 19, 2021

da3dsoul commented Aug 31, 2021

xuhuisheng commented Oct 29, 2020 •

edited

Loading

Grench6 commented Oct 30, 2020 •

edited

Loading

VegetaDTX commented Nov 1, 2020 •

edited

Loading

VegetaDTX commented Nov 2, 2020 •

edited

Loading

AsimPoptani commented Nov 3, 2020 •

edited

Loading

VegetaDTX commented Nov 3, 2020 •

edited

Loading

xuhuisheng commented Nov 6, 2020 •

edited

Loading

xuhuisheng commented Nov 6, 2020 •

edited

Loading

Grench6 commented Nov 11, 2020 •

edited

Loading

AsimPoptani commented Nov 11, 2020 •

edited

Loading

mathmax12 commented Nov 20, 2020 •

edited

Loading

mathmax12 commented Nov 21, 2020 •

edited

Loading

borgarpa commented Nov 25, 2020 •

edited

Loading