-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mismatch in ConvHipImplicitGemmV4R1Fwd
#2038
Comments
@carlushuang could you help to take a look? |
@JehandadKhan could you clarify on the reproducing steps and env? |
@carlushuang I tested on MI100 system. |
@JehandadKhan this solver is targeting non-xdlops kernels, so performance will be not good. For MI100/MI200 there are alternative solvers like cc @zjing14 |
@carlushuang @JehandadKhan : our CI is failing consistently on the following http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/develop/964/pipeline |
@carlushuang @junliume I do not remember BF16 precision problems with this solver. We need to find the root reason of the issue before trying to fix of workaround the issue. It could be, for example, a bug in the compiler. What is ROCm version?
Please look at #936. Maybe this is a kind of verification bug. |
@JehandadKhan Is it so that the I recommend renaming this ticket to "smoke_solver_ConvHipImplicitGemmV4R1 test is failing with BF16". |
@carlushuang @JehandadKhan @junliume
As we can see from #936, we have verification problems with this solver for a long time. The solver is originated from https://github.com/AMDComputeLibraries/MLOpen/pull/2132, and it seems that nobody has time to maintain it. Therefore I agree with @carlushuang and would vote for disabling/removing ConvHipImplicitGemmV4R1Fwd, but we need to make sure that performance remains at the same level. 🟡 For now, I will prepare a W/A that disables ConvHipImplicitGemmV4R1Fwd for BF16 on xDLOPs targets./CC @asroy |
The solver is applicable for MI200 (please check ConvHipImplicitGemmV4R1Fwd::IsApplicable() to see). Maybe you have some environment setting that prevents this solver from running. |
Now I see the logs and know the symptom and can explain the root reason. Symptom:The The reason of failure
Specifically, the kernel produced during the The root reasonI think it matches the root reason of #936. It could be one of these two:
|
🟡 According to the analysis above, it is highly likely that #2041 won't unblock the CI. I am going to prepare another W/A that disables tuning for ConvHipImplicitGemmV4R1Fwd during its smoke test. |
This comment was marked as off-topic.
This comment was marked as off-topic.
@junliume Oh, no this is totally different issue. Let's discuss it separately. |
@atamazov Okay. Thanks! So when MIOpen failed to compile |
@junliume This is different issue. Let's hide the comments about warning to avoid messing things. |
@junliume What I see is:
No build warnings, just validation error. |
This comment was marked as off-topic.
This comment was marked as off-topic.
As far as I see it is the Maybe you and @JehandadKhan are observing some different problem. Unfortunately the topmost description misses the name of the specific test that fails in that case. |
@atamazov sorry for the confusion on a separate issue with this one. |
@junliume Thanks for logs. The instability is due to randomization of tuning configs introduced in #1997. In your logs, this passes:
This fails:
Note that both logs end with
This is because:
|
@junliume Some clarification about suspected reason (B) listed at #2038 (comment), "The validation procedure used in our tests often produces false positives and needs to be improved." The order of computations performed by the kernel under test and by the reference data generator is important and affects RMS. This is especially important for the shortened data types, like FP16. When the computation orders become too different, the RMS may exceed the tolerance limit we have set, even if the kernel under test does all the necessary operations.
There are validation algorithms that do not depend on the order of computations, but it would take a huge amount of work and time to replace the existing verification algorithms (and that's why it wasn't done yet). So far I recommend the following:
|
@atamazov Can we close this issue ? |
No, because workaround still exists in our code. |
MIOpen develop is failing due to an issue in one of the static implicit GEMM kernels. Steps to reproduce:
Following is the current output
Interestingly the issue has started to appear in our CI since commit:
b4e0a67333ee4bbcbbec1203a0260feff2882cfb
However, I have verified that the issue exists even in prior commits such asf1196f80d251bbeaf0eb6146c7e783fc1c61bd31
All tests done on MI100
This issue is currently blocking new PRs from being merged into develop.
The text was updated successfully, but these errors were encountered: