-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Display failed shoot constraints #1144
Comments
/priority/2 because we frequently suffer in operations from clusters with broken web hooks and want to improve the situation for ops |
Can you please help me understand how we nee to improve here? Until now, I assumed that this is sufficient as these are minor issues preventing a user from performing an action (hibernation / maintenance). Please explain to me what prominent means for you. |
Well, the problem is, end users do not do something about these web hooks and so we often end up having manual effort with such clusters (their own web hooks break our operations). So, you are right, I cannot tell you why and it's just an assumption, that how we display that information today is not "triggering them enough" to repair their web hooks. I cannot tell you from UX perspective, what would work better. Red, explicit text that maintenance will fail, so Tim and I are asking whether you have an idea? The problem is that forbidding these web hooks (gardener/gardener#3244) isn't trivial either if they and we start fighting/racing (and, it's not possible to have web hook for web hook configurations either). :-( So, if you have ideas how to make it more prominent (or other ideas) that would be great. |
Ok, got it. So it is all about making this more prominent. So the solution might be to make the user aware that if he does not take care of this situation the cluster will turn into an error state at some point. Maybe we need a more prominent indicator along with a better explanation. |
@grolu Yes.
Oh, it's triggered and then it fails. We had a ton of failed clusters now again with the VPN reversal. End user web hooks couldn't be called because VPN was updated and we couldn't update it, because their web hooks blocked our pods in the kube-system namespace. That were many catch22 manual intervention issues of the more ugly kind. But also otherwise, our stuff may not get through and then end users open tickets and we start scratching our heads until we notice their web hooks that break us.
Yes.
Hmm... interesting idea. Maybe we should indeed mark such clusters as in error and basically say: that's it, you can maintain your clusters now yourselves, but for us it's now hands-off. WDYT @dguendisch @rfranzke @timebertt ? Instead of a weak condition that everybody ignores or a racing controller "patching" web hooks, how about marking it as failed (we didn't include that as an option in the discussion, did we)? With enough support/explanation in the Dashboard people might then react. Even if they ask questions what this is about, that's much easier and faster to answer than debugging strange "bugs". |
You mean to have a kind of "fail-fast" and not try to replace components that could fail by faulty webhooks (and then materialize with different other symptoms)? |
What does "marking it as failed" mean technically?
And what does this mean? Do you want Gardener/GRM to not even try reconciling the system components if such offending webhooks were found? |
@dguendisch No, I meant (and possibly @grolu as well) to mark the cluster as failed. @rfranzke This was what I was thinking:
That will make it very clear (different text, but today is April 1st). WDYT? |
I explained that badly, I'll try to make it more clear: I understood that vlerenc meant to put the cluster into |
I am more in favor of just making the constraints more visible / similarly visible like for a "failed shoot" in the dashboard. Setting such clusters to |
@rfranzke OK, so you are saying, the risk of not reconciling clusters with problematic web hooks for longer is possibly worse in terms of dev/ops effort? OK, can well be, then let's try your approach first
@rfranzke suggested "similarly visible like for a "failed shoot" in the dashboard". What do you say? |
Hmm maybe we can flag them as user issues, similar as we do for errors that require user action. We did some improvements here in the past, maybe we can follow a similar approach here. ACTION REQUIRED along with some explanation what the user needs to do. If I understand you right, the cluster should not show up as in error state. But we can show this Action Required alert box more prominent on the cluster detail page. |
That sounds good @grolu. You mean this? How about treating |
Yes, I'll build a PoC and post a screenshot here. However, it seems to me that a failed |
Why do you think so? So far, the sole check executed for these constraints is indeed only the one for broken webhooks (ref). However, we might extend this in the future if necessary (no plans yet, though). Generally, my expectation is that any |
Okay maybe I saw some with status Unknown. So if I check for status False it should always be caused by the user, right? |
We could do something like this Is this prominent enough? IDK... users already ignored the warning but maybe a red error with user error icon will help to make them aware. If we want this (or something similar) we need to talk about the texts as well as the implementation. But first let's clarify if this is the direction we want to go. Don't get confused by the error message (Shoot cluster has been hibernated.) - I had no cluster with this error and I faked it. |
I was more thinking about two red spots, one left for the overall status and one right for the new condition/chip, i.e. more red/dramatic. Just imagine your MPS instead of the API/CP icons here: Wouldn't that fit also better? You don't use the "user error icon" for condition/chips yet, right? If not, even more reason to not introduce something new in a place the user is not aware of today, but in the place he learned already that it's his responsibility to do something. P.S.: Is that some "chip-naming-automation" or can the chip also be just an "M"? |
We already show the user icon in the chips in case an error code is assigned to a condition. I'm not sure when we show it but just recently I saw the user icon next to a chip (maybe @rfranzke knowns when this can happen, if I remember right I saw this while the overall shoot status was not in error state, so just like with the new chip).
Now, also turning the shoot status into an error is exactly what we do not want, right - the shoot is not (yet) in an error state. We do not want to introduce these hacks as it is not representing the shoot status as it is in the shot resource. There may be other ways of making the user aware that there is something to do. I'll discuss this with @holgerkoser @petersutter and will get back to you.
Yes, there is an automatic naming in place for unknown conditions. We can overwrite the default naming by adding it to list of known conditions and define a proper name. |
Well, OK, we can also start with your approach and see whether people react or not @grolu . |
I will get back to you after we discussed this internally. |
Just one additional remark. In my PoC I "fake" this as user error by adding it manually in the Dashboard like this: if (!this.isMaintenancePreconditionSatisfied) {
return {
...this.maintenancePreconditionSatisfiedConstraint,
codes: [
'ERR_USER_WEBHOOK'
]
}
} @rfranzke |
Can you add the link to the best practices (https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#best-practices-and-warnings) so that the end-users have a chance to check what might be wrong? As for the error code, please open an issue at g/g. |
Well, that's less obtrusive/penetrant than the banner you suggested out-of-band. We can still give it a try, though. The point is, we want the people to start caring. |
Yes, we decided to go for this approach first as also in the issue we discussed, the user found this error message... we can still add the additional banner if the changes to not help to improve the situation. But I'm afraid that in that case also a banner will not help much. I think this should be clear enough. |
What would you like to be added:
In the shoot status, gardener already publishes the so called "constraints", e.g.
Also see documentation.
It would be good to prominently display failed constraints in the shoot's details page.
Why is this needed:
Often times problematic webhook configurations and similar might be the cause for other problems in the cluster (e.g. worker nodes not joining the cluster), that are visible in the dashboard e.g. in the health checks.
/kind enhancement
/area ops-productivity
The text was updated successfully, but these errors were encountered: