Mismatch in `ConvHipImplicitGemmV4R1Fwd` #2038

JehandadKhan · 2023-03-21T22:19:04Z

MIOpen develop is failing due to an issue in one of the static implicit GEMM kernels. Steps to reproduce:

# clone MIOpen develop
mkdir build; cd build
CXX=/opt/rocm/llvm/bin/clang++ CXXFLAGS='-Werror'  cmake -DMIOPEN_TEST_FLAGS=' --disable-verification-cache ' -DCMAKE_BUILD_TYPE=release -DBUILD_DEV=On -DMIOPEN_USE_MLIR=ON -DMIOPEN_GPU_SYNC=Off  -DMIOPEN_TEST_BFLOAT16=On -DCMAKE_PREFIX_PATH=/opt/rocm    ..
make -j 32 test_conv2d
 ./bin/test_conv2d --bfloat16 --cmode conv --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 256 32 27 27 --weights 128 32 1 1 --batch_size 256 --input_channels 32 --output_channels 128 --spatial_dim_elements 27 27 --filter_dims 1 1 --pads_strides_dilations 0 0 1 1 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout NCHW --out_layout NCHW --deterministic 0 --tensor_vect 0 --vector_length 1  --int8_vectorize 0

Following is the current output

./bin/test_conv2d --bfloat16 --cmode conv --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 256 32 27 27 --weights 128 32 1 1 --batch_size 256 --input_channels 32 --output_channels 128 --spatial_dim_elements 27 27 --filter_dims 1 1 --pads_strides_dilations 0 0 1 1 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout NCHW --out_layout NCHW --deterministic 0 --tensor_vect 0 --vector_length 1 --output_type int32 --int8_vectorize 0 
FAILED: 0.740628
Iteration: 0
Forward convolution: ConvHipImplicitGemmV4R1Fwd
Input tensor: 256, 32, 27, 27
Weights tensor: 128, 32, 1, 1
Output tensor: 256, 128, 27, 27
Filter: conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1}, 
Max diff: 1.32204e+10
Mismatch at 0: -126 != 4.42919e+09

Interestingly the issue has started to appear in our CI since commit: b4e0a67333ee4bbcbbec1203a0260feff2882cfb However, I have verified that the issue exists even in prior commits such as f1196f80d251bbeaf0eb6146c7e783fc1c61bd31

All tests done on MI100

This issue is currently blocking new PRs from being merged into develop.

The text was updated successfully, but these errors were encountered:

junliume · 2023-03-21T22:24:46Z

@carlushuang could you help to take a look?

carlushuang · 2023-03-22T05:20:17Z

This is a non-xdlops kernel targeting NCHW layout, but I can't reproduce this failure on an MI200 machine. Is it tested on MI100 or Vega?

junliume · 2023-03-22T06:43:45Z

@JehandadKhan could you clarify on the reproducing steps and env?

JehandadKhan · 2023-03-22T16:07:01Z

@carlushuang I tested on MI100 system.

carlushuang · 2023-03-23T02:41:44Z

@JehandadKhan this solver is targeting non-xdlops kernels, so performance will be not good. For MI100/MI200 there are alternative solvers like ConvHIPImplicitGemmForwardV4R4XDLops, and other asm solvers that support this case. So if this one has computation bug I think we have to disable it. What do you think?

cc @zjing14

junliume · 2023-03-23T05:36:40Z

@carlushuang @JehandadKhan : our CI is failing consistently on the following http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/develop/964/pipeline

atamazov · 2023-03-23T17:39:32Z

@carlushuang @junliume I do not remember BF16 precision problems with this solver. We need to find the root reason of the issue before trying to fix of workaround the issue. It could be, for example, a bug in the compiler. What is ROCm version?

So if this one has computation bug I think we have to disable it.

Please look at #936. Maybe this is a kind of verification bug.

atamazov · 2023-03-23T17:47:58Z

@JehandadKhan Is it so that the smoke_solver_ConvHipImplicitGemmV4R1 test is failing with BF16? This is interesting... What has changed? Maybe tuning?

I recommend renaming this ticket to "smoke_solver_ConvHipImplicitGemmV4R1 test is failing with BF16".

atamazov · 2023-03-23T18:09:49Z

@carlushuang @JehandadKhan @junliume

...For MI100/MI200 there are alternative solvers like ConvHIPImplicitGemmForwardV4R4XDLops, and other asm solvers that support this case. So if this one has computation bug I think we have to disable it. What do you think?

As we can see from #936, we have verification problems with this solver for a long time. The solver is originated from https://github.com/AMDComputeLibraries/MLOpen/pull/2132, and it seems that nobody has time to maintain it.

Therefore I agree with @carlushuang and would vote for disabling/removing ConvHipImplicitGemmV4R1Fwd, but we need to make sure that performance remains at the same level.

🟡 For now, I will prepare a W/A that disables ConvHipImplicitGemmV4R1Fwd for BF16 on xDLOPs targets.

/CC @asroy

atamazov · 2023-03-23T18:21:30Z

@carlushuang

...I can't reproduce this failure on an MI200 machine...

The solver is applicable for MI200 (please check ConvHipImplicitGemmV4R1Fwd::IsApplicable() to see). Maybe you have some environment setting that prevents this solver from running.

atamazov · 2023-03-23T21:38:12Z

Now I see the logs and know the symptom and can explain the root reason.

Symptom:

The smoke_solver_ConvHipImplicitGemmV4R1 fails. The test checks tuning, among other things. It performs only 5 rounds (to save time) of tuning and then uses the resulting tuning config (PerformanceConfig) to run the solver. Then the output fails validation.

The reason of failure

(1) The precision of ConvHipImplicitGemmV4R1Fwd depends on tuning data (PerformanceConfig) too much. With some specific PerformanceConfig values it produces kernels that do not pass validation.
(2) Limit Tuning by Time #1997 introduces changes in tuning, i.e. the order of tuning configs before and after that PR is different. Moreover, before Limit Tuning by Time #1997 the order of tuning configs was stable, now it is randomized (each time the tuning is run the order is different). Therefore all the solver tests that use MIOPEN_DEBUG_TUNING_ITERATIONS_MAX now produce some different tuning configs. As a result, these tests now validate different binary kernels.

Specifically, the kernel produced during the smoke_solver_ConvHipImplicitGemmV4R1 test accidentally turned out to be the one that does not pass the correctness check.

The root reason

I think it matches the root reason of #936. It could be one of these two:

(A) The ConvHipImplicitGemmV4R1Fwd solver is invalid. Specifically, it does not comply the requirement for IsValidPerformanceConfig() listed at Solver/Solution framework #866 (comment) which says that ""Valid" means that ... Result of execution would be numerically correct." In other words, all the PerformanceConfigs used in tuning process should produce kernels that can pass validation.
(B) The validation procedure used in our tests often produces false positives and needs to be improved.

atamazov · 2023-03-23T21:51:20Z

🟡 According to the analysis above, it is highly likely that #2041 won't unblock the CI. I am going to prepare another W/A that disables tuning for ConvHipImplicitGemmV4R1Fwd during its smoke test.

atamazov · 2023-03-23T22:01:22Z

@junliume ~~Because #1997 reorders tuning configs in the container in some random order. Now the first 5 configs are different.~~

atamazov · 2023-03-23T22:04:29Z

@junliume Oh, no this is totally different issue. Let's discuss it separately.

junliume · 2023-03-23T22:04:46Z

@junliume Because #1997 reorders tuning configs in the container in some random order. Now the first 5 configs are different.

@atamazov Okay. Thanks! So when MIOpen failed to compile ConvHIPImplicitGemmForwardV4R4XDLops it moves on to ConvHipImplicitGemmV4R1Fwd and then have some numerical issues?
BTW~ could you review if 277e35c is the right place to suppress these hipRTC warnings? Or there might be other places?

atamazov · 2023-03-23T22:05:46Z

@junliume This is different issue. Let's hide the comments about warning to avoid messing things.

atamazov · 2023-03-23T22:06:41Z

@junliume What I see is:

[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GetAllConfigs] ConvHipImplicitGemmV4R1Fwd: Searching the best solution among 33...
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 0/0/5 0.23334, best within recent 1: 0.23334 #0 16,32,8,2,2,2,4,4,2,4,8,1,16,1,4,32, ETA:0 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 1/0/5 0.23334, best within recent 1: 0.750709 #1 16,32,4,2,4,4,2,4,2,4,4,1,16,1,4,16, ETA:53.7877 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 2/0/5 0.23334, best within recent 1: 0.285917 #2 16,128,4,2,4,4,4,4,4,4,4,2,16,2,2,128, ETA:28.3322 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 3/0/5 0.114943, best within recent 1: 0.114943 #3 16,128,16,2,4,4,4,4,4,4,16,1,16,1,2,128, ETA:15.7516 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GenericSearch] Done: 5/0/5, best #3 0.114943 16,128,16,2,4,4,4,4,4,4,16,1,16,1,2,128
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GenericSearch] ...Score: 2.44989 (default time 0.281597)
[2023-03-23T18:53:48.490Z] FAILED: 0.535381
[2023-03-23T18:53:48.490Z] Max diff: 255
[2023-03-23T18:53:48.490Z] Mismatch at 0: 30 != 129

No build warnings, just validation error.

atamazov · 2023-03-23T22:15:33Z

why ConvHIPImplicitGemmForwardV4R4XDLops is not picked when it should. This is about the warnings in compilation and should be mentioned in a separate issue (thus hide here);

As far as I see it is the smoke_solver_ConvHipImplicitGemmV4R1 test that fails in our CI. The ConvHIPImplicitGemmForwardV4R4XDLops solver shouldn't be picked in that test.

Maybe you and @JehandadKhan are observing some different problem. Unfortunately the topmost description misses the name of the specific test that fails in that case.

junliume · 2023-03-23T23:18:08Z

@atamazov sorry for the confusion on a separate issue with this one.
I am running ctest -R smoke_solver_ConvHipImplicitGemmV4R1 and having run to run issue (almost fail about half times). The passing and failing logs are attached
failing_mark.log
passing_mark.log

atamazov · 2023-03-24T11:12:27Z

@junliume Thanks for logs. The instability is due to randomization of tuning configs introduced in #1997. In your logs, this passes:

ConvHipImplicitGemmV4R1Fwd:16,64,8,2,2,2,4,4,4,4,8,2,16,1,4,64

This fails:

ConvHipImplicitGemmV4R1Fwd:16,64,16,2,2,2,4,4,4,4,16,1,16,1,4,64

Note that both logs end with

smoke_solver_ConvHipImplicitGemmV4R1 ...***Failed  Error regular expression found in output. Regex=[(FAILED)|(Error)|(failed)]

This is because:

The TRACE log level was used
The word "n_failed" was added to TRACE logging level by PR 1659 "add extra logging to search for tracking each run time" in the private repo (https://github.com/AMDComputeLibraries/MLOpen/pull/1659)
- https://github.com/ROCmSoftwarePlatform/MIOpen/blob/e0179c0869b5a0d91bff53058ced1ada44dd3a31/src/include/miopen/generic_search.hpp#L458-L461
- For now, MIOpen should avoid using word "failed" in logs except real failures.
- @JehandadKhan May I change the format of this TRACE message to match other messages?

atamazov · 2023-03-27T15:53:03Z

@junliume Some clarification about suspected reason (B) listed at #2038 (comment), "The validation procedure used in our tests often produces false positives and needs to be improved." The order of computations performed by the kernel under test and by the reference data generator is important and affects RMS. This is especially important for the shortened data types, like FP16. When the computation orders become too different, the RMS may exceed the tolerance limit we have set, even if the kernel under test does all the necessary operations.

[Informative] An example of validation issue that was resolved by fixing the order of computations of reference data: [OCL] Batchnorm verification issues with some configs #1974.

There are validation algorithms that do not depend on the order of computations, but it would take a huge amount of work and time to replace the existing verification algorithms (and that's why it wasn't done yet).

So far I recommend the following:

Manually check all known failing cases to confirm that these failures are not caused by (A) "The ConvHipImplicitGemmV4R1Fwd solver is invalid". This can be done by kernel's authors.
- The failing test cases can be found here and in [iGemmfwd][test_conv2d][gfx906][half] Verification failed #936.
If the kernel is "bad", then let's consider disabling it for FP16/BF16.
If the kernel is "good", then let's try to increase tolerance for iGemm algorithm (as proposed at [iGemmfwd][test_conv2d][gfx906][half] Verification failed #936 (comment)) and remove all workarounds.

atamazov · 2023-03-30T21:11:27Z

#2043 is merged now.

[New Attribution] @junliume @johnny-keker https://github.com/ROCmSoftwarePlatform/MIOpen/labels/bug https://github.com/ROCmSoftwarePlatform/MIOpen/labels/complexity_high https://github.com/ROCmSoftwarePlatform/MIOpen/labels/correctness https://github.com/ROCmSoftwarePlatform/MIOpen/labels/testing https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_normal https://github.com/ROCmSoftwarePlatform/MIOpen/labels/workaround

JehandadKhan · 2023-04-12T08:18:02Z

@atamazov Can we close this issue ?

atamazov · 2023-04-13T11:58:06Z

No, because workaround still exists in our code.

JehandadKhan assigned junliume Mar 21, 2023

JehandadKhan added bug value_high urgency_blocker labels Mar 21, 2023

junliume assigned carlushuang Mar 21, 2023

atamazov mentioned this issue Mar 23, 2023

[HOTFIX] WORKAROUND_ISSUE_2038: Disable FP16 and BF16 in ConvHipImplicitGemmV4R1Fwd for MI100/200 and all new targets. #2041

Closed

This comment was marked as off-topic.

Sign in to view

atamazov mentioned this issue Mar 23, 2023

[HOTFIX][tests] WORKAROUND_ISSUE_2038: Disable validation of FP16 and BF16 in the smoke test of ConvHipImplicitGemmV4R1Fwd #2043

Merged

junliume added testing complexity_high correctness workaround and removed urgency_blocker labels Apr 3, 2023

atamazov mentioned this issue Apr 8, 2023

[gfx11] MIOpen unit tests are failing - possible false alarm? #2079

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch in `ConvHipImplicitGemmV4R1Fwd` #2038

Mismatch in `ConvHipImplicitGemmV4R1Fwd` #2038

JehandadKhan commented Mar 21, 2023 •

edited

Loading

junliume commented Mar 21, 2023

carlushuang commented Mar 22, 2023 •

edited

Loading

junliume commented Mar 22, 2023

JehandadKhan commented Mar 22, 2023

carlushuang commented Mar 23, 2023 •

edited

Loading

junliume commented Mar 23, 2023

atamazov commented Mar 23, 2023

atamazov commented Mar 23, 2023

atamazov commented Mar 23, 2023 •

edited

Loading

atamazov commented Mar 23, 2023

atamazov commented Mar 23, 2023 •

edited

Loading

atamazov commented Mar 23, 2023

This comment was marked as off-topic.

atamazov commented Mar 23, 2023 •

edited

Loading

atamazov commented Mar 23, 2023

junliume commented Mar 23, 2023

atamazov commented Mar 23, 2023

atamazov commented Mar 23, 2023

This comment was marked as off-topic.

atamazov commented Mar 23, 2023

junliume commented Mar 23, 2023

atamazov commented Mar 24, 2023 •

edited

Loading

atamazov commented Mar 27, 2023 •

edited

Loading

atamazov commented Mar 30, 2023 •

edited

Loading

JehandadKhan commented Apr 12, 2023

atamazov commented Apr 13, 2023

Mismatch in ConvHipImplicitGemmV4R1Fwd #2038

Mismatch in ConvHipImplicitGemmV4R1Fwd #2038

Comments

JehandadKhan commented Mar 21, 2023 • edited Loading

junliume commented Mar 21, 2023

carlushuang commented Mar 22, 2023 • edited Loading

junliume commented Mar 22, 2023

JehandadKhan commented Mar 22, 2023

carlushuang commented Mar 23, 2023 • edited Loading

junliume commented Mar 23, 2023

atamazov commented Mar 23, 2023

atamazov commented Mar 23, 2023

atamazov commented Mar 23, 2023 • edited Loading

🟡 For now, I will prepare a W/A that disables ConvHipImplicitGemmV4R1Fwd for BF16 on xDLOPs targets.

atamazov commented Mar 23, 2023

atamazov commented Mar 23, 2023 • edited Loading

Symptom:

The reason of failure

The root reason

atamazov commented Mar 23, 2023

🟡 According to the analysis above, it is highly likely that #2041 won't unblock the CI. I am going to prepare another W/A that disables tuning for ConvHipImplicitGemmV4R1Fwd during its smoke test.

This comment was marked as off-topic.

atamazov commented Mar 23, 2023 • edited Loading

atamazov commented Mar 23, 2023

junliume commented Mar 23, 2023

atamazov commented Mar 23, 2023

atamazov commented Mar 23, 2023

This comment was marked as off-topic.

atamazov commented Mar 23, 2023

junliume commented Mar 23, 2023

atamazov commented Mar 24, 2023 • edited Loading

atamazov commented Mar 27, 2023 • edited Loading

atamazov commented Mar 30, 2023 • edited Loading

JehandadKhan commented Apr 12, 2023

atamazov commented Apr 13, 2023

Mismatch in `ConvHipImplicitGemmV4R1Fwd` #2038

Mismatch in `ConvHipImplicitGemmV4R1Fwd` #2038

JehandadKhan commented Mar 21, 2023 •

edited

Loading

carlushuang commented Mar 22, 2023 •

edited

Loading

carlushuang commented Mar 23, 2023 •

edited

Loading

atamazov commented Mar 23, 2023 •

edited

Loading

atamazov commented Mar 23, 2023 •

edited

Loading

atamazov commented Mar 23, 2023 •

edited

Loading

atamazov commented Mar 24, 2023 •

edited

Loading

atamazov commented Mar 27, 2023 •

edited

Loading

atamazov commented Mar 30, 2023 •

edited

Loading