Do not fail the generic search if n_runs_total is zero; turns warnings into infos #2266

dmikushin · 2023-07-19T20:55:10Z

This PR proposes a fix for #2253

…s into infos

junliume

Visual exam looks good to me, will let CI run through once.

atamazov · 2023-07-20T07:27:40Z

This is conceptually wrong and must be reverted.

The tunable solver must have >1 performance configs available.
If there is only 1 performance configs, then solver is not tunable.
If there is no performance configs, then the solver is not applicable.

The solver must be fixed instead of camouflaging the actual problem.

atamazov · 2023-07-20T07:39:00Z

If we need some urgent patch like this one, then it should be formalized as a workaround at least.

dmikushin · 2023-07-20T09:00:39Z

Hi @atamazov , your point is that this function should not be called under these conditions, and the fix should have been done earlier. If so, then the throw should be at the beginning of the function, not at the end as it is now. That is, if the arguments are not valid, you need to crash the function right at the very beginning. And now the crash is at the end, and this gives many the impression that something has not grown together in the function itself. That, in fact, is what led both Jun and me to this patch.

atamazov · 2023-07-20T09:58:03Z

src/include/miopen/generic_search.hpp

@@ -212,7 +212,7 @@ class HeartBeat
                n_recent != 0u ? (static_cast<float>(n_total - n_recent) *
                                  (elapsed_cumulative / static_cast<float>(n_recent)) / 1000.0f)
                               : 0.0f; // paraniod
-            MIOPEN_LOG_W(n_recent << '/' << n_failed << '/' << n_total << ' ' << total_best
+            MIOPEN_LOG_I(n_recent << '/' << n_failed << '/' << n_total << ' ' << total_best


The tuning process should be visible with the default logging level (4). Please revert this and other similar changes.

atamazov · 2023-07-20T10:10:09Z

Hi @atamazov , your point is that this function should not be called under these conditions, and the fix should have been done earlier.

Incorrect. What I said is that the Solver has issues.

If so, then the throw should be at the beginning of the function, not at the end as it is now. That is, if the arguments are not valid, you need to crash the function right at the very beginning.

This proposal is cosmetic, but I am not against it provided that the proper fix will be provided first.

DrizztDoUrden · 2023-07-20T11:39:58Z

I am okay with checking at the beginning, but this PR removes error reporting entirely.
The solver is invalid and should be fixed, but that is out of the scope of generic search: error reported by generic search is not necessary caused by generic search.

dmikushin · 2023-07-20T12:35:12Z

I think we all agree that the problem is caused by a broken solver (and a broken test, which lets it pass into production). But more generically it is caused by not checking the input/output state of functions at entry, where such checks should be performed. Our codebase is large, errors and inconsistencies are normal to happen, and we should stop assuming all code can be fully trusted and adopt zero-trust design. But so far this point of view is not supported by anyone. As a result, this PR becomes a necessary practice, and until the solver is not fixed, this change should stay as an ugly but reasonable "hotfix".

DrizztDoUrden · 2023-07-20T13:12:44Z

I think we all agree that the problem is caused by a broken solver

No, after talking to @shurale-nkn I think that it is caused by incorrect usage of our internal APIs from outside of MIOpen.

this change should stay as an ugly but reasonable "hotfix"

This is not a fix for two reasons.

There is no error in GenericSearch. It correctly reports that it's been provided incorrect data for whatever reason. As a method of error reporting, a log line is printed and an exception is thrown. The exception is handled by find and the solver is skipped as invalid for this case.
This doesn't fix error. It propagates error further to GetSolution. Is returning a default-constructed performance config when solver says no config is valid considered a valid practice? I don't think so. The only way to know that such practice is acceptable is to somehow fetch info from the solver itself. As of now it just assumes that this way of handling is correct (which it is likely not).

@junliume My advice is to revert the PR and to find what causes the error. I think that it is incorrect usage by MIGraphX, but I am not sure. Maybe, there is some undocumented UB in our internals (which is not good), but that's kind of why they are internals and not public API. I am 100% sure that the only change that may be made to GenericSearch from the report is to add

if (n_runs_total == 0)
    MIOPEN_THROW(...);

at the begfining of the function. But the only effect it would have is skipping 0 config space size warnings before the error.

DrizztDoUrden · 2023-07-20T13:14:50Z

I am sorry that I was unable to review the PR yesterday, but the time window was 11 PM to 1 AM CEST.

dmikushin · 2023-07-20T13:19:10Z

```c++
if (n_runs_total == 0)
    MIOPEN_THROW(...);
at the begfining of the function. But the only effect it would have is skipping 0 config space size warnings before the error.

@junliume but is your goal to avoid the throw?

junliume · 2023-07-20T15:07:33Z

@dmikushin @atamazov @DrizztDoUrden I have asked for this WA as a quick patch.
Internal user is claiming for the benefit of end-user experience, they are complaining about excessive warning and error messages in their log. Now I realize that these warning levels are there for purpose.

The root cause is in the solvers, can we leave it as a WA for 5.6.1, and investigate complete fix in mainline?

DrizztDoUrden · 2023-07-20T15:12:47Z

It's either solver or usage. Considering that in SWDEV three solvers fail in a row I would bet on usage.

DrizztDoUrden · 2023-07-21T11:39:25Z

The root cause is in the solvers, can we leave it as a WA for 5.6.1, and investigate complete fix in mainline?

If it fixes the problem of excessive logging and doesn't cause gpu/cpu segfault, than it can be left as is, I guess, but this basically leads to undefined behaviour as solvers are not forced to check if performance config from GenericSearch is valid and default initialized config is not guaranteed to be valid.

atamazov · 2023-07-21T20:34:07Z

@junliume

The root cause is in the solvers, can we leave it as a WA for 5.6.1, and investigate complete fix in mainline?

Absolutely.

… warnings into infos (#2266)" This reverts commit 6795a81.

* [FIX SW 396203] check launch kernel grid size not beyond 32bit integer (#2263) * check 32bit launch size * Revert "Do not fail the generic search if n_runs_total is zero; turns warnings into infos (#2266)" This reverts commit 6795a81. * Patch half.hpp file location reorg (#2275) * [Tuning][MI100][MI210][MI250] Gold18 (#2264) * gold18 db update, remove detectron2 configs to allow miopen heuristic * remove invalid performance configs --------- Co-authored-by: carlushuang <[email protected]> Co-authored-by: Jun Liu <[email protected]>

Do not fail the generic search if n_runs_total is zero; turns warning…

9857c75

…s into infos

dmikushin requested a review from junliume July 19, 2023 20:55

junliume requested review from DrizztDoUrden and averinevg July 19, 2023 21:00

junliume approved these changes Jul 19, 2023

View reviewed changes

junliume added this to the ROCm 5.6 milestone Jul 19, 2023

junliume added the urgency_high label Jul 19, 2023

junliume merged commit 6795a81 into develop Jul 19, 2023

atamazov mentioned this pull request Jul 20, 2023

[Exhaustive Tuning] Search failed Error message when no solution is available #2253

Closed

atamazov reviewed Jul 20, 2023

View reviewed changes

junliume mentioned this pull request Jul 21, 2023

Issue with #2266 and fix within mainline #2270

Closed

junliume added a commit that referenced this pull request Jul 25, 2023

Revert "Do not fail the generic search if n_runs_total is zero; turns…

7264cb8

… warnings into infos (#2266)" This reverts commit 6795a81.

junliume deleted the issue-2253 branch August 28, 2023 23:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not fail the generic search if n_runs_total is zero; turns warnings into infos #2266

Do not fail the generic search if n_runs_total is zero; turns warnings into infos #2266

dmikushin commented Jul 19, 2023

junliume left a comment

atamazov commented Jul 20, 2023

atamazov commented Jul 20, 2023

dmikushin commented Jul 20, 2023

atamazov Jul 20, 2023

atamazov commented Jul 20, 2023

DrizztDoUrden commented Jul 20, 2023 •

edited

Loading

dmikushin commented Jul 20, 2023 •

edited

Loading

DrizztDoUrden commented Jul 20, 2023 •

edited

Loading

DrizztDoUrden commented Jul 20, 2023 •

edited

Loading

dmikushin commented Jul 20, 2023

junliume commented Jul 20, 2023

DrizztDoUrden commented Jul 20, 2023

DrizztDoUrden commented Jul 21, 2023 •

edited

Loading

atamazov commented Jul 21, 2023

Do not fail the generic search if n_runs_total is zero; turns warnings into infos #2266

Do not fail the generic search if n_runs_total is zero; turns warnings into infos #2266

Conversation

dmikushin commented Jul 19, 2023

junliume left a comment

Choose a reason for hiding this comment

atamazov commented Jul 20, 2023

atamazov commented Jul 20, 2023

dmikushin commented Jul 20, 2023

atamazov Jul 20, 2023

Choose a reason for hiding this comment

atamazov commented Jul 20, 2023

DrizztDoUrden commented Jul 20, 2023 • edited Loading

dmikushin commented Jul 20, 2023 • edited Loading

DrizztDoUrden commented Jul 20, 2023 • edited Loading

DrizztDoUrden commented Jul 20, 2023 • edited Loading

dmikushin commented Jul 20, 2023

junliume commented Jul 20, 2023

DrizztDoUrden commented Jul 20, 2023

DrizztDoUrden commented Jul 21, 2023 • edited Loading

atamazov commented Jul 21, 2023

DrizztDoUrden commented Jul 20, 2023 •

edited

Loading

dmikushin commented Jul 20, 2023 •

edited

Loading

DrizztDoUrden commented Jul 20, 2023 •

edited

Loading

DrizztDoUrden commented Jul 20, 2023 •

edited

Loading

DrizztDoUrden commented Jul 21, 2023 •

edited

Loading