Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GFX803 to TF-ROCm Continuous Integration #479

Closed
Bengt opened this issue May 30, 2019 · 10 comments
Closed

Add GFX803 to TF-ROCm Continuous Integration #479

Bengt opened this issue May 30, 2019 · 10 comments
Labels
gfx803 issue specific to gfx803 GPUs

Comments

@Bengt
Copy link

Bengt commented May 30, 2019

Describe the feature and the current behavior/state.

Currently:

[...] GFX803 is not included in the TF-ROCm CI systems [...].
#431 (comment)

Feature:

Add at least one variant of GFX803 GPUs to the TF-ROCm CI systems.

Who will benefit with this feature?

There have already been uncaught regressions, which are still present in the master branch:

https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues?q=is%3Aissue+is%3Aopen+label%3Agfx803

At the time of writing, these regressions are accountable for over one third (7 out of 20) of all issues in this repository.

These could be caught by regression tests, like this one:

Yes, this is a good [regression] test case[...].
#432 (comment)

Everyone using an gfx803 GPU would benefit from not having these and possibly many other regressions in the future.

The gfx803 chips have a large install base and its support is thus important to AMD's reputation, so AMD would benefit, too.

As listed in the LLVM docs, there are three chips under the gfx803 target, namely polaris10, polaris11 and fiji, because they are only node shrinks, this target also includes polaris30, polaris20, polaris21:

https://github.com/llvm-mirror/llvm/blob/f26b156fd2f58f49d3190a45c07e25c15b0bc0ae/lib/Support/TargetParser.cpp#L91

This means (unless I am still missing some) the following graphics cards are affected:

  • Fiji
    • Fiji XT
      • Radeon Instinct MI8
      • Radeon R9 Fury X
      • Radeon R9 Fury
      • Radeon R9 Nano
    • Capsaicin XT
      • FirePro S9300x2
      • Radeon Pro Duo 2016
  • Polaris 30
    • Radeon RX 590
  • Polaris 20
    • Radeon Pro 580
    • Radeon RX 580
    • Radeon Pro 575
    • Radeon Pro 570
    • Radeon RX 570
  • Polaris 10
    • Radeon Instinct MI6
    • Radeon Pro Duo 2017
    • Radeon Pro WX 7100
    • Radeon Pro WX 7100 Mobile
    • Radeon RX 480
    • Radeon Pro WX 5100
    • Radeon RX 470
  • Polaris 21
    • Radeon Pro 560X
    • Radeon Pro 560
    • Radeon Pro 555X
    • Radeon Pro 555
  • Polaris 11
    • Radeon Pro WX 4100
    • Radeon Pro WX 4170 Mobile
    • Radeon Pro WX 4150 Mobile
    • Radeon Pro WX 4130 Mobile
    • Radeon RX 560D
    • Radeon RX 460

Note that these GPUs span from the mobile parts, via the low-end 460, over mid range GPUs like the 470/80 to the then-high-end Fury-X, as well as the workstation Pro Duos to the server-grade FirePro / Instinct cards. So users of virtually any dGPU market segment would benefit.

Using https://gpu.userbenchmark.com/ as a rough estimation of popularity, 5 of the top 10 AMD GPUs and 9 of the top 20 are affected. Therefore, many users that already have compatible AMD hardware are bound to have a frustrating experience when they try to use their officially supported GPUs with Tensorflow-ROCm and then run into the unfixed regressions.

@gaetanbahl
Copy link

I have a R9 Fury. I can confirm that I am affected by bugs/regressions and may need to switch to Nvidia cards because of this.

@Bengt
Copy link
Author

Bengt commented Jun 5, 2019 via email

@sunway513
Copy link

Hi @Bengt , firstly thank you for the suggestions, we appreciate your effort to conclude the GFX803 impact and understood your concerns.
Due to the limited resources, the QA and CI coverages on GFX803 boards are not as comprehensive as GFX900 (Vega10) and GFX906 (Vega20) targets.
I will convey your message to the team and see if anything we can do to improve it.

Specific to the quoted issues, we have pushed out a set of OpenCL fixes for GFX803 targets in ROCm2.5 release, we believe the following two issues should have been fixed:
#301
#302
Please kindly try it out with ROCm2.5 and let us know your feedback.

@Canadauni
Copy link

Canadauni commented Jun 9, 2019

It looks like issues relating to memory allocation that you've listed have been solved for me. Running a cnn no longer locks up my display as they were in ROCm2.4. However, running the mnist cnn example from keras shows a failure to converge as described in #432.

@sunway513
Copy link

Hi @Canadauni , thank you for confirming the memory allocation issues have been fixed!
For issue #432, we've been using our internal ticket system to track, we will update on the issue thread when there's progress.

@sunway513 sunway513 added the gfx803 issue specific to gfx803 GPUs label Jun 10, 2019
@thegatsbylofiexperience

On this note, I'd also like to say, I recently bought a RX580 to start doing some deep learning on, Is the "New Era of Open GPU Computing" filled with promises of things working but nothing actually does?

It might seem like good business sense to put more resources in the higher end cards (from a management perspective that makes complete sense). From a buyer perspective, its the opposite, we start with the cheapest to see if something works and then move up the chain when/if it does. If it doesn't we can move on.

The fact of the matter is right now, I regret my purchase decision. Would I upgrade to a vega gpu in the future based on my current experiences? the answer is no.

I am sorry to lecture on this point, but there is a business case here to have better support, I hope you pass this on.

@dagamayank
Copy link

@dbouius-AMD

@gaetanbahl
Copy link

On this note, I'd also like to say, I recently bought a RX580 to start doing some deep learning on, Is the "New Era of Open GPU Computing" filled with promises of things working but nothing actually does?

It might seem like good business sense to put more resources in the higher end cards (from a management perspective that makes complete sense). From a buyer perspective, its the opposite, we start with the cheapest to see if something works and then move up the chain when/if it does. If it doesn't we can move on.

The fact of the matter is right now, I regret my purchase decision. Would I upgrade to a vega gpu in the future based on my current experiences? the answer is no.

I am sorry to lecture on this point, but there is a business case here to have better support, I hope you pass this on.

Yup, that sums it up nicely. I would have bought a Radeon VII if my Fury was supported correctly. I went with a 2080 instead.

I still have my Fury and can still help with GFX803 support testing if needed, though.

@Bengt
Copy link
Author

Bengt commented Jun 5, 2020

Linus Torvalds recently switched to an RX 580. That underlines the relevance of its support:

https://t3n.de/news/threadripper-linus-torvalds-arbeitsrechner-hardware-1287213/

@ROCmSupport
Copy link

Thanks for reaching out.
gfx8 is not a supported config now.
We are not supporting gfx8 devices officially with ROCm and request you to follow our supported hardware section @ ROCm docs: https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gfx803 issue specific to gfx803 GPUs
Projects
None yet
Development

No branches or pull requests

7 participants