-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test failures and crashes on 580 #92
Comments
Running on master still has failing tests, but way fewer:
The matrix multiplication still crashes
Here is the manifest:
|
Seems like it might be a crash in rocBLAS, but I'm not sure since I don't regularly run AMDGPU with it enabled (because it sucks to build). Do you have rocBLAS installed? |
I do not think so. I checked with |
I checked a couple of times with and without rocblas (by running |
I attempted various debug and serialization flags, as suggested in ROCm/tensorflow-upstream#302 and in https://rocmdocs.amd.com/en/latest/Other_Solutions/Other-Solutions.html , but I did not get any debug info out to Here is my attempt with the entirety of its console output:
|
I've also noticed that HSAKMT environment variables don't work with AMDGPU.jl. We don't do any stderr capture to my knowledge. Do note that those variables apply to HCC, HIP, and MIOpen, none of which we use in any significant capacity (except for HIP, for device sync, which is not done automatically). |
All of this was on rocm 4. I tried also installing I ended downgrading to rocm 3.5.1. Now There are test failures for the current release of
And here are the tests on the current master branch, doing a bit better, but still having errors:
|
Am I correct in assuming that if I want to use 580 with AMDGPU.jl, I have to freeze rocm to version 3.5.1 and just hope for "best effort", without any guarantees given that the device seems to be going out of support in rocm? Should I freeze the AMDGPU.jl version too? Should I expect future versions of AMDGPU.jl to lower the level of support for 580? Is there a more "official" table of support, giving hardware versions, rocm versions, and AMDGPU.jl versions that are tested/supported? |
Sigh... now there is a separate problem (on rocm 3.5.1 and AMDGPU#master) that simply gives wrong answers (no crash, just incorrect answers) when I do matrix multiplication:
If you guys have any suggestions where to look for the source of these issues (or whether I should downgrade/upgrade to other versions), let me know. Either way, thanks for your effort in putting this library together! Some community-sourced table of "this hardware ran successfully for me" would be really useful. |
I tested this on my Vega system, and I also get a memory access fault. I'll run this under my newly-working debugger in the next day or two. Btw, our CI was running on an RX480 for the longest time, but I had to remove the card because HIP started killing the build process due to not being able to find code for the GPU (stupid problem, I should reproduce it and patch it upstream). I'll probably put the RX480 in another machine and add it to the CI queue so that we ensure that we still have working support. |
Is there a way to donate to the CI effort? (money or compute time, especially if I can get my 580 to do CI for you; I am competent enough sysadmin to run a docker on this computer that is accessible to your CI jobs). It is in my selfish interest to get 580 with configuration similar to mine (ubuntu with same drivers and rocm version) ;) By the way, as a new users I was definitely very confused by what rocm version I should be using. What version of rocm is used by the CI? |
We currently use Buildkite to host CI, which runs under docker-compose, so it's pretty nicely isolated. I'll talk to the JuliaGPU devs and see what they think. Also, the ROCm config is not fixed to a particular version, which is something I would like to fix by providing ROCm libraries as JLLs, but that's complicated by such a config not working on my musl system 😄 It's on the roadmap, though. |
While I wait for a response on the CI question, I found that the issue does not turn into a regular device error when running with |
Regarding CI: because adding buildkite agents requires sharing our global secret key with the agent's owner, we can't reasonably accept outside CI. However, I plan to setup an RX480 runner and ensure that we run it for all PRs, to ensure older cards still work as much as possible. We'll also be potentially getting access to a lot of newer (but still Vega arch) AMD GPUs soon, so hopefully we can use some of them for CI. |
In terms of donations from the community, I would appreciate any bug reports, code contributions, or ideas for improvements you and others might have. That's more valuable to me than CI by a long shot 🙂 |
Sounds great! If this starts working I would certainly be active giving feedback. I do have a bunch of projects that would use bitwise operations on integer types, so hopefully I will be able to stress-test that side of the project. |
My understanding is that the 580 is going out of support, but for what is worth, here is a test run and a console session with failures.
Is there any expectation for these tests to ever pass on 580?
Let me know how I can help fix these issues (if possible). I have zero knowledge of the low-level implementation of the gpu support.
A failed attempt at matrix-vector multiplication
The test summary
rocminfo
clinfo
The text was updated successfully, but these errors were encountered: