Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with #2266 and fix within mainline #2270

Closed
junliume opened this issue Jul 21, 2023 · 14 comments
Closed

Issue with #2266 and fix within mainline #2270

junliume opened this issue Jul 21, 2023 · 14 comments

Comments

@junliume
Copy link
Contributor

junliume commented Jul 21, 2023

[Issue]
I explicitly asked for and am responsible for the implementation of #2266.

After some discussion we decided to find the root cause instead of hiding it.

Let's see how to fix it properly. Thanks!

[Content]
more on #2266

I think we all agree that the problem is caused by a broken solver

No, after talking to @shurale-nkn I think that it is caused by incorrect usage of our internal APIs from outside of MIOpen.

this change should stay as an ugly but reasonable "hotfix"

This is not a fix for two reasons.

  1. There is no error in GenericSearch. It correctly reports that it's been provided incorrect data for whatever reason. As a method of error reporting, a log line is printed and an exception is thrown. The exception is handled by find and the solver is skipped as invalid for this case.
  2. This doesn't fix error. It propagates error further to GetSolution. Is returning a default-constructed performance config when solver says no config is valid considered a valid practice? I don't think so. The only way to know that such practice is acceptable is to somehow fetch info from the solver itself. As of now it just assumes that this way of handling is correct (which it is likely not).

@junliume My advice is to revert the PR and to find what causes the error. I think that it is incorrect usage by MIGraphX, but I am not sure. Maybe, there is some undocumented UB in our internals (which is not good), but that's kind of why they are internals and not public API. I am 100% sure that the only change that may be made to GenericSearch from the report is to add

if (n_runs_total == 0)
    MIOPEN_THROW(...);

at the begfining of the function. But the only effect it would have is skipping 0 config space size warnings before the error.

Originally posted by @DrizztDoUrden in #2266 (comment)

@junliume
Copy link
Contributor Author

@DrizztDoUrden @atamazov @shurale-nkn @dmikushin
Indeed it seems there are consequences if we tolerate n_runs_total == 0 without throwing errors:
Now the run fails with:

[2023-07-23T11:45:17.673Z] MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' static_kernel_gridwise_convolution_forward_implicit_gemm_v4r4_xdlops_nchw_kcyx_nkhw.cpp: HIPRTC_ERROR_COMPILATION (6)
[2023-07-23T11:45:17.673Z] MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: static_kernel_gridwise_convolution_forward_implicit_gemm_v4r4_xdlops_nchw_kcyx_nkhw.cpp
[2023-07-23T11:45:17.673Z] MIOpen(HIP): Warning [BuildHip] In file included from /tmp/comgr-8fd95e/input/static_kernel_gridwise_convolution_forward_implicit_gemm_v4r4_xdlops_nchw_kcyx_nkhw.cpp:2:
[2023-07-23T11:45:17.673Z] In file included from /tmp/comgr-8fd95e/include/static_kernel_gridwise_convolution_forward_implicit_gemm_v4r4_xdlops_nchw_kcyx_nkhw.hpp:8:
[2023-07-23T11:45:17.673Z] In file included from /tmp/comgr-8fd95e/include/static_kernel_gridwise_gemm_xdlops_fp16_bfp16.hpp:10:
[2023-07-23T11:45:17.673Z] In file included from /tmp/comgr-8fd95e/include/static_kernel_blockwise_gemm_xdlops.hpp:6:
[2023-07-23T11:45:17.673Z] /tmp/comgr-8fd95e/include/static_kernel_xdlops_gemm.hpp:1018:43: error: function 'GetXdlopsInfo<_Float16, 4, 4>' with deduced return type cannot be used before it is defined
[2023-07-23T11:45:17.673Z]     static constexpr index_t MRepeats   = GetXdlopsInfo().MRepeats;
[2023-07-23T11:45:17.673Z]                                           ^

We should not have these tuning parameter GetXdlopsInfo<_Float16, 4, 4> to begin with, it could either:

  1. MIOpen host/API did not filter out these invalid parameters before passing it to compiler to compiler?
  2. MIOpen is forced to bypass "inapplicable()" check.

Could you please provide some opinions? Thanks!

@atamazov
Copy link
Contributor

atamazov commented Jul 24, 2023

@junliume

  1. MIOpen host/API did not filter out these invalid parameters before passing it to compiler to compiler?

MIOpen performs validation, but it uses Solver's functionality for this. If a Solver has issues (for example, IsValidPerformanceConfig is incorrect), then wrong performanceConfig may sneak onto a compilation path. Note that there are numerous potential reasons of compilation failures (e.g. error in the kernel source, error in the compiler, and different errors in the solver).

  1. MIOpen is forced to bypass "inapplicable()" check.

Perhaps this is possible via the Find 2.0 API? @DrizztDoUrden, am I right?

@atamazov
Copy link
Contributor

[Informative] Please consider this as a continuation of #2253 (which is not really fixed yet).

@atamazov
Copy link
Contributor

atamazov commented Jul 24, 2023

🔴 In order to find the root cause we need clear reproduction instructions.

@junliume
Copy link
Contributor Author

I am working on a minimal reproduce example.

@junliume
Copy link
Contributor Author

@atamazov the log is attached.
pt_resnet50v1.5_pruned0.4_1.2_M_float.log

However, when I tried to run:

MIOPEN_FIND_ENFORCE=3 ./bin/MIOpenDriver convfp16 -n 1 -c 256 -H 56 -W 56 -k 20 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 0

I do not get similar output as shown in the log.

root@zt-dh170-16:/opt/rocm# MIOPEN_FIND_ENFORCE=3 ./bin/MIOpenDriver convfp16 -n 1 -c 256 -H 56 -W 56 -k 20 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 0
MIOpenDriver convfp16 -n 1 -c 256 -H 56 -W 56 -k 20 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 0
Forward Conv solutions available: 2
- id: 88 algo: 0, time: 11.1111 ms, ws: 0, name: GemmFwd1x1_0_1
- id: 85 algo: 1, time: 1000 ms, ws: 0, name: ConvDirectNaiveConvFwd
MIOpen Forward Conv. Algorithm: 0, Solution: 88/GemmFwd1x1_0_1
GPU Kernel Time Forward Conv. Elapsed: 0.029036 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 1, 256, 56, 56, 1, 1, 20,  32112640, 1615872, 125440, 1106, 60, 0.029036
Forward Convolution Verifies OK on CPU reference (0.000258885)

in the attached log:

MIOpen(HIP): Command [LogCmdFindConvolution] ./bin/MIOpenDriver convfp16 -n 1 -c 256 -H 56 -W 56 -k 20 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 0
MIOpen(HIP): Warning [SearchImpl] Searching the best solution in the 4 dim space. Please, be patient...
MIOpen(HIP): Warning [SearchImpl] Runs left: 53, min time so far: 0.03984, curr time: 0.03984 1,64,1,1,0,1,4,4,0
MIOpen(HIP): Warning [SearchImpl] Runs left: 48, min time so far: 0.02816, curr time: 0.06512 1,64,1,1,0,4,4,8,0
MIOpen(HIP): Warning [SearchImpl] Runs left: 43, min time so far: 0.02752, curr time: 0.06832 1,64,1,1,0,4,8,4,0
MIOpen(HIP): Warning [SearchImpl] Runs left: 38, min time so far: 0.02752, curr time: 0.03504 1,64,1,1,0,2,16,8,0
MIOpen(HIP): Warning [SearchImpl] Runs left: 33, min time so far: 0.02752, curr time: 0.04992 1,128,1,1,0,2,4,4,0
MIOpen(HIP): Warning [SearchImpl] Runs left: 28, min time so far: 0.02752, curr time: 0.0288 1,128,1,1,0,1,8,8,0
MIOpen(HIP): Warning [SearchImpl] Runs left: 23, min time so far: 0.02752, curr time: 0.038721 1,128,1,1,0,1,16,4,0
MIOpen(HIP): Warning [SearchImpl] Runs left: 18, min time so far: 0.02752, curr time: 0.065441 1,128,1,1,0,4,16,8,0
MIOpen(HIP): Warning [SearchImpl] Runs left: 13, min time so far: 0.02752, curr time: 0.09616 1,256,1,1,0,4,4,4,0
MIOpen(HIP): Warning [SearchImpl] Runs left: 8, min time so far: 0.02752, curr time: 0.039681 1,256,1,1,0,2,8,8,0
MIOpen(HIP): Warning [SearchImpl] Runs left: 3, min time so far: 0.02752, curr time: 0.048321 1,256,1,1,0,2,16,4,0
MIOpen(HIP): Warning [SearchImpl] Default run, min time so far: 0.02752, default time: 0.04912 16,64,32,32,0,2,16,4,1
MIOpen(HIP): Warning [SearchImpl] ...Score: 1.78488
MIOpen(HIP): Error [GetSolution] invalid performance parameter

@junliume
Copy link
Contributor Author

The original reproducer is

MIOPEN_FIND_ENFORCE=3 migraphx-driver perf resnet50_model.onnx --enable-offload-copy 

So I suspect if it can be reproduced by MIOpenDriver or not at all.

@DrizztDoUrden
Copy link
Contributor

MIGrpahX uses find 2.0 API. There can be a bug in the API implementation.
I'll switch to this and try to debug it. It may reproduce by test_conv_2d_find2, unless it's either incorrect usage (I haven't studied 100% of MIGraphX source related to find 2.0, so I can't vouch for whether it is good or not) or our test missises some options. Which it kind of does, we only try to cover net configs there. That is not the only info the library gets from user.
Also, the MIGraphX command has resnet50_model.onnx. I am not familiar with these formats and MIGraphX caching, but if solution is loaded, IsApplicable is likely to be skipped. Maybe we should attach MIOpen version to solution and force testing IsApplicable for such cases. It would be better to attach specific solver versions or even IsApplicable version, but it would be hard to enforce developers actually incrementing them on every meaningful change.

@junliume
Copy link
Contributor Author

junliume commented Jul 25, 2023

@DrizztDoUrden yes I guess "IsApplicable is likely to be skipped" is the problem.

Here is another reproducible example: when "ConvHipImplicitGemmForwardV4R4Xdlops" solver is used explicitly by passing "-S 64" to the driver:

./bin/MIOpenDriver convfp16 -n 4 -c 88 -H 20 -W 20 -k 336 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 64

Warning: Solution id (64) is not reported by the library. Trying it anyway...
MIOpen Forward Conv. Algorithm: -1, Solution: 64/ConvHipImplicitGemmForwardV4R4Xdlops
GPU Kernel Time Forward Conv. Elapsed: 0.013387 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 4, 88, 20, 20, 1, 1, 336,  94617600, 340736, 1075200, 7068, 106, 0.013387
Forward Convolution FAILED: 0.183908 > 0.0082

@junliume
Copy link
Contributor Author

Is there an env var where we can use FIND 2.0 path? @DrizztDoUrden

@junliume
Copy link
Contributor Author

Another issue with #2266 (again my mistake :( ) is

when n_runs_tota==0, we should not execute the lines below:
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/7264cb80f5a75d2edc57ea8f3557af10a10ea2ff/src/include/miopen/generic_search.hpp#L556-L565

@DrizztDoUrden
Copy link
Contributor

Is there an env var where we can use FIND 2.0 path? @DrizztDoUrden

I don't understand the question, find 2.0 in develop calls find 1.0 internally. There no find 2.0 specific files or paths.
If you want to execute the config from command line - the only way is to use test_conv*_find2

@DrizztDoUrden
Copy link
Contributor

@DrizztDoUrden yes I guess "IsApplicable is likely to be skipped" is the problem.

Here is another reproducible example: when "ConvHipImplicitGemmForwardV4R4Xdlops" solver is used explicitly by passing "-S 64" to the driver:

./bin/MIOpenDriver convfp16 -n 4 -c 88 -H 20 -W 20 -k 336 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -S 64

Warning: Solution id (64) is not reported by the library. Trying it anyway...
MIOpen Forward Conv. Algorithm: -1, Solution: 64/ConvHipImplicitGemmForwardV4R4Xdlops
GPU Kernel Time Forward Conv. Elapsed: 0.013387 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 4, 88, 20, 20, 1, 1, 336,  94617600, 340736, 1075200, 7068, 106, 0.013387
Forward Convolution FAILED: 0.183908 > 0.0082

This looks like a bug in ConvHipImplicitGemmForwardV4R4Xdlops. It has produced a wrong result via immediate mode.

@atamazov
Copy link
Contributor

@junliume The problems in the core tuning algorithm are resolved now, and I don't see what else needs to be done to resolve this issue. Could you please review this ticket, and close it or denote unresolved sub-issues. Thank you.

@junliume junliume closed this as completed Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants