Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch in ConvHipImplicitGemmV4R1Fwd #2038

Open
JehandadKhan opened this issue Mar 21, 2023 · 26 comments
Open

Mismatch in ConvHipImplicitGemmV4R1Fwd #2038

JehandadKhan opened this issue Mar 21, 2023 · 26 comments

Comments

@JehandadKhan
Copy link
Contributor

JehandadKhan commented Mar 21, 2023

MIOpen develop is failing due to an issue in one of the static implicit GEMM kernels. Steps to reproduce:

# clone MIOpen develop
mkdir build; cd build
CXX=/opt/rocm/llvm/bin/clang++ CXXFLAGS='-Werror'  cmake -DMIOPEN_TEST_FLAGS=' --disable-verification-cache ' -DCMAKE_BUILD_TYPE=release -DBUILD_DEV=On -DMIOPEN_USE_MLIR=ON -DMIOPEN_GPU_SYNC=Off  -DMIOPEN_TEST_BFLOAT16=On -DCMAKE_PREFIX_PATH=/opt/rocm    ..
make -j 32 test_conv2d
 ./bin/test_conv2d --bfloat16 --cmode conv --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 256 32 27 27 --weights 128 32 1 1 --batch_size 256 --input_channels 32 --output_channels 128 --spatial_dim_elements 27 27 --filter_dims 1 1 --pads_strides_dilations 0 0 1 1 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout NCHW --out_layout NCHW --deterministic 0 --tensor_vect 0 --vector_length 1  --int8_vectorize 0

Following is the current output

./bin/test_conv2d --bfloat16 --cmode conv --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 256 32 27 27 --weights 128 32 1 1 --batch_size 256 --input_channels 32 --output_channels 128 --spatial_dim_elements 27 27 --filter_dims 1 1 --pads_strides_dilations 0 0 1 1 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout NCHW --out_layout NCHW --deterministic 0 --tensor_vect 0 --vector_length 1 --output_type int32 --int8_vectorize 0 
FAILED: 0.740628
Iteration: 0
Forward convolution: ConvHipImplicitGemmV4R1Fwd
Input tensor: 256, 32, 27, 27
Weights tensor: 128, 32, 1, 1
Output tensor: 256, 128, 27, 27
Filter: conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1}, 
Max diff: 1.32204e+10
Mismatch at 0: -126 != 4.42919e+09

Interestingly the issue has started to appear in our CI since commit: b4e0a67333ee4bbcbbec1203a0260feff2882cfb However, I have verified that the issue exists even in prior commits such as f1196f80d251bbeaf0eb6146c7e783fc1c61bd31

All tests done on MI100

This issue is currently blocking new PRs from being merged into develop.

@junliume
Copy link
Collaborator

@carlushuang could you help to take a look?

@carlushuang
Copy link
Contributor

carlushuang commented Mar 22, 2023

This is a non-xdlops kernel targeting NCHW layout, but I can't reproduce this failure on an MI200 machine. Is it tested on MI100 or Vega?
image

@junliume
Copy link
Collaborator

@JehandadKhan could you clarify on the reproducing steps and env?

@JehandadKhan
Copy link
Contributor Author

@carlushuang I tested on MI100 system.

@carlushuang
Copy link
Contributor

carlushuang commented Mar 23, 2023

@JehandadKhan this solver is targeting non-xdlops kernels, so performance will be not good. For MI100/MI200 there are alternative solvers like ConvHIPImplicitGemmForwardV4R4XDLops, and other asm solvers that support this case. So if this one has computation bug I think we have to disable it. What do you think?

cc @zjing14

@junliume
Copy link
Collaborator

@atamazov
Copy link
Contributor

@carlushuang @junliume I do not remember BF16 precision problems with this solver. We need to find the root reason of the issue before trying to fix of workaround the issue. It could be, for example, a bug in the compiler. What is ROCm version?

So if this one has computation bug I think we have to disable it.

Please look at #936. Maybe this is a kind of verification bug.

@atamazov
Copy link
Contributor

@JehandadKhan Is it so that the smoke_solver_ConvHipImplicitGemmV4R1 test is failing with BF16? This is interesting... What has changed? Maybe tuning?

I recommend renaming this ticket to "smoke_solver_ConvHipImplicitGemmV4R1 test is failing with BF16".

@atamazov
Copy link
Contributor

atamazov commented Mar 23, 2023

@carlushuang @JehandadKhan @junliume

...For MI100/MI200 there are alternative solvers like ConvHIPImplicitGemmForwardV4R4XDLops, and other asm solvers that support this case. So if this one has computation bug I think we have to disable it. What do you think?

As we can see from #936, we have verification problems with this solver for a long time. The solver is originated from https://github.com/AMDComputeLibraries/MLOpen/pull/2132, and it seems that nobody has time to maintain it.

Therefore I agree with @carlushuang and would vote for disabling/removing ConvHipImplicitGemmV4R1Fwd, but we need to make sure that performance remains at the same level.

🟡 For now, I will prepare a W/A that disables ConvHipImplicitGemmV4R1Fwd for BF16 on xDLOPs targets.

/CC @asroy

@atamazov
Copy link
Contributor

@carlushuang

...I can't reproduce this failure on an MI200 machine...

The solver is applicable for MI200 (please check ConvHipImplicitGemmV4R1Fwd::IsApplicable() to see). Maybe you have some environment setting that prevents this solver from running.

@atamazov
Copy link
Contributor

atamazov commented Mar 23, 2023

Now I see the logs and know the symptom and can explain the root reason.

Symptom:

The smoke_solver_ConvHipImplicitGemmV4R1 fails. The test checks tuning, among other things. It performs only 5 rounds (to save time) of tuning and then uses the resulting tuning config (PerformanceConfig) to run the solver. Then the output fails validation.

The reason of failure

  • (1) The precision of ConvHipImplicitGemmV4R1Fwd depends on tuning data (PerformanceConfig) too much. With some specific PerformanceConfig values it produces kernels that do not pass validation.
  • (2) Limit Tuning by Time  #1997 introduces changes in tuning, i.e. the order of tuning configs before and after that PR is different. Moreover, before Limit Tuning by Time  #1997 the order of tuning configs was stable, now it is randomized (each time the tuning is run the order is different). Therefore all the solver tests that use MIOPEN_DEBUG_TUNING_ITERATIONS_MAX now produce some different tuning configs. As a result, these tests now validate different binary kernels.

Specifically, the kernel produced during the smoke_solver_ConvHipImplicitGemmV4R1 test accidentally turned out to be the one that does not pass the correctness check.

The root reason

I think it matches the root reason of #936. It could be one of these two:

  • (A) The ConvHipImplicitGemmV4R1Fwd solver is invalid. Specifically, it does not comply the requirement for IsValidPerformanceConfig() listed at Solver/Solution framework #866 (comment) which says that ""Valid" means that ... Result of execution would be numerically correct." In other words, all the PerformanceConfigs used in tuning process should produce kernels that can pass validation.
  • (B) The validation procedure used in our tests often produces false positives and needs to be improved.

@atamazov
Copy link
Contributor

🟡 According to the analysis above, it is highly likely that #2041 won't unblock the CI. I am going to prepare another W/A that disables tuning for ConvHipImplicitGemmV4R1Fwd during its smoke test.

@junliume

This comment was marked as off-topic.

@atamazov
Copy link
Contributor

atamazov commented Mar 23, 2023

@junliume Because #1997 reorders tuning configs in the container in some random order. Now the first 5 configs are different.

@atamazov
Copy link
Contributor

@junliume Oh, no this is totally different issue. Let's discuss it separately.

@junliume
Copy link
Collaborator

@junliume Because #1997 reorders tuning configs in the container in some random order. Now the first 5 configs are different.

@atamazov Okay. Thanks! So when MIOpen failed to compile ConvHIPImplicitGemmForwardV4R4XDLops it moves on to ConvHipImplicitGemmV4R1Fwd and then have some numerical issues?
BTW~ could you review if 277e35c is the right place to suppress these hipRTC warnings? Or there might be other places?

@atamazov
Copy link
Contributor

@junliume This is different issue. Let's hide the comments about warning to avoid messing things.

@atamazov
Copy link
Contributor

@junliume What I see is:

[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GetAllConfigs] ConvHipImplicitGemmV4R1Fwd: Searching the best solution among 33...
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 0/0/5 0.23334, best within recent 1: 0.23334 #0 16,32,8,2,2,2,4,4,2,4,8,1,16,1,4,32, ETA:0 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 1/0/5 0.23334, best within recent 1: 0.750709 #1 16,32,4,2,4,4,2,4,2,4,4,1,16,1,4,16, ETA:53.7877 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 2/0/5 0.23334, best within recent 1: 0.285917 #2 16,128,4,2,4,4,4,4,4,4,4,2,16,2,2,128, ETA:28.3322 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 3/0/5 0.114943, best within recent 1: 0.114943 #3 16,128,16,2,4,4,4,4,4,4,16,1,16,1,2,128, ETA:15.7516 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GenericSearch] Done: 5/0/5, best #3 0.114943 16,128,16,2,4,4,4,4,4,4,16,1,16,1,2,128
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GenericSearch] ...Score: 2.44989 (default time 0.281597)
[2023-03-23T18:53:48.490Z] FAILED: 0.535381
[2023-03-23T18:53:48.490Z] Max diff: 255
[2023-03-23T18:53:48.490Z] Mismatch at 0: 30 != 129

No build warnings, just validation error.

@junliume

This comment was marked as off-topic.

@atamazov
Copy link
Contributor

why ConvHIPImplicitGemmForwardV4R4XDLops is not picked when it should. This is about the warnings in compilation and should be mentioned in a separate issue (thus hide here);

As far as I see it is the smoke_solver_ConvHipImplicitGemmV4R1 test that fails in our CI. The ConvHIPImplicitGemmForwardV4R4XDLops solver shouldn't be picked in that test.

Maybe you and @JehandadKhan are observing some different problem. Unfortunately the topmost description misses the name of the specific test that fails in that case.

@junliume
Copy link
Collaborator

@atamazov sorry for the confusion on a separate issue with this one.
I am running ctest -R smoke_solver_ConvHipImplicitGemmV4R1 and having run to run issue (almost fail about half times). The passing and failing logs are attached
failing_mark.log
passing_mark.log

@atamazov
Copy link
Contributor

atamazov commented Mar 24, 2023

@junliume Thanks for logs. The instability is due to randomization of tuning configs introduced in #1997. In your logs, this passes:

ConvHipImplicitGemmV4R1Fwd:16,64,8,2,2,2,4,4,4,4,8,2,16,1,4,64

This fails:

ConvHipImplicitGemmV4R1Fwd:16,64,16,2,2,2,4,4,4,4,16,1,16,1,4,64

Note that both logs end with

smoke_solver_ConvHipImplicitGemmV4R1 ...***Failed  Error regular expression found in output. Regex=[(FAILED)|(Error)|(failed)]

This is because:

@atamazov
Copy link
Contributor

atamazov commented Mar 27, 2023

@junliume Some clarification about suspected reason (B) listed at #2038 (comment), "The validation procedure used in our tests often produces false positives and needs to be improved." The order of computations performed by the kernel under test and by the reference data generator is important and affects RMS. This is especially important for the shortened data types, like FP16. When the computation orders become too different, the RMS may exceed the tolerance limit we have set, even if the kernel under test does all the necessary operations.

There are validation algorithms that do not depend on the order of computations, but it would take a huge amount of work and time to replace the existing verification algorithms (and that's why it wasn't done yet).

So far I recommend the following:

@JehandadKhan
Copy link
Contributor Author

@atamazov Can we close this issue ?

@atamazov
Copy link
Contributor

No, because workaround still exists in our code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants