-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BF16] ConvHipImplicitGemmBwdDataV1R1: Memory access faults. #309
Comments
@atamazov Thanks for reporting the issue. I was able to reproduce the issue on develop. One thing I am noticing here, it skipped all solvers that needs workspace, including bwddatav1r1 igemm solver.
It raises a couple of questions
|
I just ran with -F 2. It tells me that segfault comes from SetTensor kernel that initializes workspace to 0 before running bwdv1r1 kernel
This is similar to what I am currently debugging for my PR #305 . I am suspecting something is going wrong in creating workspace buffer after this new invoker rearchitect. Nothing has changed on Radeon VII pertaining to bwdv1r1 kernel except moving to invoker design. Please point me to the invoker design doc or any PR describing its design. |
@atamazov I tried on rocm3.1 using rocm/miopen-private:rocm3.1-tf1.15-dev-modified_clamp_device docker which uses hcc. It works fine. ./bin/MIOpenDriver convbfp16 -x 3 -y 3 -W 54 -H 54 -c 64 -n 8 -k 64 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -V 1 -s 0 -F 2 -w 2 -t 1 -i 6 |
@TejashShah You can try |
Yes, it fixes the issue in rocm3.5, pointing to the hipclang compiler issue One more thing: Although I suspected SetTensor kernel to be the cause of segfault from earlier log as it was the last in log. But this lowering of optimization is targeted only for hip kernel, not ocl kernel (SetTensor kernel). Thus, it leads me to conclude that the actual segfaul came from bwdv1r1 kernel but, for some reason, the log seems to be running behind the actual kernel. So, until the compiler fix arrives, #310 could be temporary workaround. |
This is protection implemented in Invokers.
|
@TejashShah Could you please check on ROCm 3.3 with a number of BF16 configs and confirm that it is free of this problem. Thanks. |
Most likely the program is terminated before the relevant log line is printed (console output is buffered). This is happens very often in case of fatal errors. |
It passed. |
@TejashShah Worth another JIRA ticket for hip-clang |
@asroy Yes, I am trying to pinpoint particular instruction causing it. Apparently, rocm-debug-agent doesnt work in rocm3.5 |
@daniellowell Right now we have workaround. I recommend checking if problem is fixed in the recent compiler and disable it. Could you please make an assignment? AFAICS http://ontrack-internal.amd.com/browse/SWDEV-243048 is not resolved yet. |
I tried rocm3.9, but these cases still failed if using -O3, but pass if using -O1. But they are no longer having memory access fault, instead they produce wrong result (if using -O3). This error can be captured by: which is testing the same as MIOpenDriver cmd n the problem description. ===== output with -O1: ===== output with -O3: |
Tried rocm/miopen-private:compute-rocm-dkms-staging-4309 The issue is gone |
@asroy Great! We can switch the workaround OFF and then close the issue as soon as we see that the issue is gone in some ROCm release. |
As per http://ontrack-internal.amd.com/browse/SWDEV-264644, the fix for compiler is in the mainline. |
The fix is expected in 4.2 release. For validation, refer to SWDEV for docker image with release candidate. |
Radeon VII, vanilla ROCm 3.5:
The text was updated successfully, but these errors were encountered: