Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not fail the generic search if n_runs_total is zero; turns warnings into infos #2266

Merged
merged 1 commit into from
Jul 19, 2023

Conversation

dmikushin
Copy link
Contributor

This PR proposes a fix for #2253

@dmikushin dmikushin requested a review from junliume July 19, 2023 20:55
Copy link
Contributor

@junliume junliume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visual exam looks good to me, will let CI run through once.

@junliume junliume added this to the ROCm 5.6 milestone Jul 19, 2023
@junliume junliume merged commit 6795a81 into develop Jul 19, 2023
@atamazov
Copy link
Contributor

This is conceptually wrong and must be reverted.

  • The tunable solver must have >1 performance configs available.
  • If there is only 1 performance configs, then solver is not tunable.
  • If there is no performance configs, then the solver is not applicable.

The solver must be fixed instead of camouflaging the actual problem.

@atamazov
Copy link
Contributor

If we need some urgent patch like this one, then it should be formalized as a workaround at least.

@dmikushin
Copy link
Contributor Author

Hi @atamazov , your point is that this function should not be called under these conditions, and the fix should have been done earlier. If so, then the throw should be at the beginning of the function, not at the end as it is now. That is, if the arguments are not valid, you need to crash the function right at the very beginning. And now the crash is at the end, and this gives many the impression that something has not grown together in the function itself. That, in fact, is what led both Jun and me to this patch.

@@ -212,7 +212,7 @@ class HeartBeat
n_recent != 0u ? (static_cast<float>(n_total - n_recent) *
(elapsed_cumulative / static_cast<float>(n_recent)) / 1000.0f)
: 0.0f; // paraniod
MIOPEN_LOG_W(n_recent << '/' << n_failed << '/' << n_total << ' ' << total_best
MIOPEN_LOG_I(n_recent << '/' << n_failed << '/' << n_total << ' ' << total_best
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tuning process should be visible with the default logging level (4). Please revert this and other similar changes.

@atamazov
Copy link
Contributor

Hi @atamazov , your point is that this function should not be called under these conditions, and the fix should have been done earlier.

Incorrect. What I said is that the Solver has issues.

If so, then the throw should be at the beginning of the function, not at the end as it is now. That is, if the arguments are not valid, you need to crash the function right at the very beginning.

This proposal is cosmetic, but I am not against it provided that the proper fix will be provided first.

@DrizztDoUrden
Copy link
Contributor

DrizztDoUrden commented Jul 20, 2023

I am okay with checking at the beginning, but this PR removes error reporting entirely.
The solver is invalid and should be fixed, but that is out of the scope of generic search: error reported by generic search is not necessary caused by generic search.

@dmikushin
Copy link
Contributor Author

dmikushin commented Jul 20, 2023

I think we all agree that the problem is caused by a broken solver (and a broken test, which lets it pass into production). But more generically it is caused by not checking the input/output state of functions at entry, where such checks should be performed. Our codebase is large, errors and inconsistencies are normal to happen, and we should stop assuming all code can be fully trusted and adopt zero-trust design. But so far this point of view is not supported by anyone. As a result, this PR becomes a necessary practice, and until the solver is not fixed, this change should stay as an ugly but reasonable "hotfix".

@DrizztDoUrden
Copy link
Contributor

DrizztDoUrden commented Jul 20, 2023

I think we all agree that the problem is caused by a broken solver

No, after talking to @shurale-nkn I think that it is caused by incorrect usage of our internal APIs from outside of MIOpen.

this change should stay as an ugly but reasonable "hotfix"

This is not a fix for two reasons.

  1. There is no error in GenericSearch. It correctly reports that it's been provided incorrect data for whatever reason. As a method of error reporting, a log line is printed and an exception is thrown. The exception is handled by find and the solver is skipped as invalid for this case.
  2. This doesn't fix error. It propagates error further to GetSolution. Is returning a default-constructed performance config when solver says no config is valid considered a valid practice? I don't think so. The only way to know that such practice is acceptable is to somehow fetch info from the solver itself. As of now it just assumes that this way of handling is correct (which it is likely not).

@junliume My advice is to revert the PR and to find what causes the error. I think that it is incorrect usage by MIGraphX, but I am not sure. Maybe, there is some undocumented UB in our internals (which is not good), but that's kind of why they are internals and not public API. I am 100% sure that the only change that may be made to GenericSearch from the report is to add

if (n_runs_total == 0)
    MIOPEN_THROW(...);

at the begfining of the function. But the only effect it would have is skipping 0 config space size warnings before the error.

@DrizztDoUrden
Copy link
Contributor

DrizztDoUrden commented Jul 20, 2023

I am sorry that I was unable to review the PR yesterday, but the time window was 11 PM to 1 AM CEST.

@dmikushin
Copy link
Contributor Author

```c++
if (n_runs_total == 0)
    MIOPEN_THROW(...);

at the begfining of the function. But the only effect it would have is skipping 0 config space size warnings before the error.

@junliume but is your goal to avoid the throw?

@junliume
Copy link
Contributor

@dmikushin @atamazov @DrizztDoUrden I have asked for this WA as a quick patch.
Internal user is claiming for the benefit of end-user experience, they are complaining about excessive warning and error messages in their log. Now I realize that these warning levels are there for purpose.

The root cause is in the solvers, can we leave it as a WA for 5.6.1, and investigate complete fix in mainline?

@DrizztDoUrden
Copy link
Contributor

It's either solver or usage. Considering that in SWDEV three solvers fail in a row I would bet on usage.

@DrizztDoUrden
Copy link
Contributor

DrizztDoUrden commented Jul 21, 2023

The root cause is in the solvers, can we leave it as a WA for 5.6.1, and investigate complete fix in mainline?

If it fixes the problem of excessive logging and doesn't cause gpu/cpu segfault, than it can be left as is, I guess, but this basically leads to undefined behaviour as solvers are not forced to check if performance config from GenericSearch is valid and default initialized config is not guaranteed to be valid.

@atamazov
Copy link
Contributor

@junliume

The root cause is in the solvers, can we leave it as a WA for 5.6.1, and investigate complete fix in mainline?

Absolutely.

junliume added a commit that referenced this pull request Jul 25, 2023
@junliume junliume deleted the issue-2253 branch August 28, 2023 23:40
cderb added a commit that referenced this pull request Sep 18, 2023
* [FIX SW 396203] check launch kernel grid size not beyond 32bit integer (#2263)

* check 32bit launch size

* Revert "Do not fail the generic search if n_runs_total is zero; turns warnings into infos (#2266)"

This reverts commit 6795a81.

* Patch half.hpp file location reorg (#2275)

* [Tuning][MI100][MI210][MI250] Gold18 (#2264)

* gold18 db update, remove detectron2 configs to allow miopen heuristic

* remove invalid performance configs

---------

Co-authored-by: carlushuang <[email protected]>
Co-authored-by: Jun Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants