-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not fail the generic search if n_runs_total is zero; turns warnings into infos #2266
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visual exam looks good to me, will let CI run through once.
This is conceptually wrong and must be reverted.
The solver must be fixed instead of camouflaging the actual problem. |
If we need some urgent patch like this one, then it should be formalized as a workaround at least. |
Hi @atamazov , your point is that this function should not be called under these conditions, and the fix should have been done earlier. If so, then the throw should be at the beginning of the function, not at the end as it is now. That is, if the arguments are not valid, you need to crash the function right at the very beginning. And now the crash is at the end, and this gives many the impression that something has not grown together in the function itself. That, in fact, is what led both Jun and me to this patch. |
@@ -212,7 +212,7 @@ class HeartBeat | |||
n_recent != 0u ? (static_cast<float>(n_total - n_recent) * | |||
(elapsed_cumulative / static_cast<float>(n_recent)) / 1000.0f) | |||
: 0.0f; // paraniod | |||
MIOPEN_LOG_W(n_recent << '/' << n_failed << '/' << n_total << ' ' << total_best | |||
MIOPEN_LOG_I(n_recent << '/' << n_failed << '/' << n_total << ' ' << total_best |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuning process should be visible with the default logging level (4). Please revert this and other similar changes.
Incorrect. What I said is that the Solver has issues.
This proposal is cosmetic, but I am not against it provided that the proper fix will be provided first. |
I am okay with checking at the beginning, but this PR removes error reporting entirely. |
I think we all agree that the problem is caused by a broken solver (and a broken test, which lets it pass into production). But more generically it is caused by not checking the input/output state of functions at entry, where such checks should be performed. Our codebase is large, errors and inconsistencies are normal to happen, and we should stop assuming all code can be fully trusted and adopt zero-trust design. But so far this point of view is not supported by anyone. As a result, this PR becomes a necessary practice, and until the solver is not fixed, this change should stay as an ugly but reasonable "hotfix". |
No, after talking to @shurale-nkn I think that it is caused by incorrect usage of our internal APIs from outside of MIOpen.
This is not a fix for two reasons.
@junliume My advice is to revert the PR and to find what causes the error. I think that it is incorrect usage by MIGraphX, but I am not sure. Maybe, there is some undocumented UB in our internals (which is not good), but that's kind of why they are internals and not public API. I am 100% sure that the only change that may be made to GenericSearch from the report is to add if (n_runs_total == 0)
MIOPEN_THROW(...); at the begfining of the function. But the only effect it would have is skipping 0 config space size warnings before the error. |
I am sorry that I was unable to review the PR yesterday, but the time window was 11 PM to 1 AM CEST. |
@junliume but is your goal to avoid the throw? |
@dmikushin @atamazov @DrizztDoUrden I have asked for this WA as a quick patch. The root cause is in the solvers, can we leave it as a WA for 5.6.1, and investigate complete fix in mainline? |
It's either solver or usage. Considering that in SWDEV three solvers fail in a row I would bet on usage. |
If it fixes the problem of excessive logging and doesn't cause gpu/cpu segfault, than it can be left as is, I guess, but this basically leads to undefined behaviour as solvers are not forced to check if performance config from GenericSearch is valid and default initialized config is not guaranteed to be valid. |
Absolutely. |
* [FIX SW 396203] check launch kernel grid size not beyond 32bit integer (#2263) * check 32bit launch size * Revert "Do not fail the generic search if n_runs_total is zero; turns warnings into infos (#2266)" This reverts commit 6795a81. * Patch half.hpp file location reorg (#2275) * [Tuning][MI100][MI210][MI250] Gold18 (#2264) * gold18 db update, remove detectron2 configs to allow miopen heuristic * remove invalid performance configs --------- Co-authored-by: carlushuang <[email protected]> Co-authored-by: Jun Liu <[email protected]>
This PR proposes a fix for #2253