Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display failed shoot constraints #1144

Closed
timebertt opened this issue Dec 13, 2021 · 28 comments · Fixed by #1193
Closed

Display failed shoot constraints #1144

timebertt opened this issue Dec 13, 2021 · 28 comments · Fixed by #1193
Labels
area/ops-productivity Operator productivity related (how to improve operations) component/dashboard Gardener Dashboard kind/enhancement Enhancement, improvement, extension priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@timebertt
Copy link
Member

What would you like to be added:

In the shoot status, gardener already publishes the so called "constraints", e.g.

status:
  constraints:
    - type: HibernationPossible
      status: 'False'
      lastTransitionTime: '2021-12-08T08:59:44Z'
      lastUpdateTime: '2021-12-08T02:56:17Z'
      reason: ProblematicWebhooks
      message: >-
        ValidatingWebhookConfiguration "opa-validating-webhook" is problematic:
        webhook "validating-webhook.openpolicyagent.org" with failurePolicy
        "Ignore" and 30s timeout might prevent worker nodes from properly
        joining the shoot cluster
    - type: MaintenancePreconditionsSatisfied
      status: 'False'
      lastTransitionTime: '2021-12-08T08:59:44Z'
      lastUpdateTime: '2021-12-08T02:56:17Z'
      reason: ProblematicWebhooks
      message: >-
        ValidatingWebhookConfiguration "opa-validating-webhook" is problematic:
        webhook "validating-webhook.openpolicyagent.org" with failurePolicy
        "Ignore" and 30s timeout might prevent worker nodes from properly
        joining the shoot cluster

Also see documentation.

It would be good to prominently display failed constraints in the shoot's details page.

Why is this needed:

Often times problematic webhook configurations and similar might be the cause for other problems in the cluster (e.g. worker nodes not joining the cluster), that are visible in the dashboard e.g. in the health checks.

  • When operators start investigating such issues, it would be helpful to make them aware early on about the failed constraints, because it might speed up the process of investigation.
  • When users notice such issues, they might be able to help themselves already by looking at the failed constraint's messages.

/kind enhancement
/area ops-productivity

@timebertt timebertt added kind/enhancement Enhancement, improvement, extension component/dashboard Gardener Dashboard labels Dec 13, 2021
@gardener-robot gardener-robot added the area/ops-productivity Operator productivity related (how to improve operations) label Dec 13, 2021
@vlerenc vlerenc added the priority/2 Priority (lower number equals higher priority) label Mar 28, 2022
@vlerenc
Copy link
Member

vlerenc commented Mar 28, 2022

/priority/2 because we frequently suffer in operations from clusters with broken web hooks and want to improve the situation for ops

@grolu
Copy link
Contributor

grolu commented Mar 29, 2022

Can you please help me understand how we nee to improve here?
We already show failed constraints as part of the shoot overview as well as on the status page.

Screen Shot 2022-03-29 at 15 42 33

Screen Shot 2022-03-29 at 15 42 41

Screen Shot 2022-03-29 at 15 42 52

Screen Shot 2022-03-29 at 15 43 07

Until now, I assumed that this is sufficient as these are minor issues preventing a user from performing an action (hibernation / maintenance). Please explain to me what prominent means for you.

@vlerenc
Copy link
Member

vlerenc commented Apr 1, 2022

Well, the problem is, end users do not do something about these web hooks and so we often end up having manual effort with such clusters (their own web hooks break our operations). So, you are right, I cannot tell you why and it's just an assumption, that how we display that information today is not "triggering them enough" to repair their web hooks.

I cannot tell you from UX perspective, what would work better. Red, explicit text that maintenance will fail, so Tim and I are asking whether you have an idea?

The problem is that forbidding these web hooks (gardener/gardener#3244) isn't trivial either if they and we start fighting/racing (and, it's not possible to have web hook for web hook configurations either). :-(

So, if you have ideas how to make it more prominent (or other ideas) that would be great.

@grolu
Copy link
Contributor

grolu commented Apr 1, 2022

Ok, got it. So it is all about making this more prominent.
Just for me to understand this right - The main problem here is that there are clusters where maintenance cannot be triggered due to some webhook configuration. These clusters will not receive updates (and other things that are taken care of during maintenance cannot be performed). Eventually those clusters will break and we have to deal with them, right?

So the solution might be to make the user aware that if he does not take care of this situation the cluster will turn into an error state at some point. Maybe we need a more prominent indicator along with a better explanation.
Please correct me if I'm wrong. Otherwise we will look into this direction.

@vlerenc
Copy link
Member

vlerenc commented Apr 1, 2022

Ok, got it. So it is all about making this more prominent.

@grolu Yes.

Just for me to understand this right - The main problem here is that there are clusters where maintenance cannot be triggered due to some webhook configuration.

Oh, it's triggered and then it fails. We had a ton of failed clusters now again with the VPN reversal. End user web hooks couldn't be called because VPN was updated and we couldn't update it, because their web hooks blocked our pods in the kube-system namespace. That were many catch22 manual intervention issues of the more ugly kind. But also otherwise, our stuff may not get through and then end users open tickets and we start scratching our heads until we notice their web hooks that break us.

Eventually those clusters will break and we have to deal with them, right?

Yes.

So the solution might be to make the user aware that if he does not take care of this situation the cluster will turn into an error state at some point. Maybe we need a more prominent indicator along with a better explanation.
Please correct me if I'm wrong. Otherwise we will look into this direction.

Hmm... interesting idea. Maybe we should indeed mark such clusters as in error and basically say: that's it, you can maintain your clusters now yourselves, but for us it's now hands-off.

WDYT @dguendisch @rfranzke @timebertt ? Instead of a weak condition that everybody ignores or a racing controller "patching" web hooks, how about marking it as failed (we didn't include that as an option in the discussion, did we)? With enough support/explanation in the Dashboard people might then react. Even if they ask questions what this is about, that's much easier and faster to answer than debugging strange "bugs".

@dguendisch
Copy link
Member

how about marking it as failed

You mean to have a kind of "fail-fast" and not try to replace components that could fail by faulty webhooks (and then materialize with different other symptoms)?
I like this idea 👍

@rfranzke
Copy link
Member

rfranzke commented Apr 1, 2022

What does "marking it as failed" mean technically?

and not try to replace components that could fail by faulty webhooks

And what does this mean? Do you want Gardener/GRM to not even try reconciling the system components if such offending webhooks were found?

@vlerenc
Copy link
Member

vlerenc commented Apr 1, 2022

You mean to have a kind of "fail-fast" and not try to replace components that could fail by faulty webhooks (and then materialize with different other symptoms)?

@dguendisch No, I meant (and possibly @grolu as well) to mark the cluster as failed.

@rfranzke This was what I was thinking:

status:
  lastOperation:
    description: >-
      You have bad web hooks. Fix them. We won't reconcile your cluster anymore. You are on your own. Enough is enough!
    state: Failed
    type: Reconcile
  lastErrors:
    - description: >-
      You have bad web hooks. Fix them. We won't reconcile your cluster anymore. You are on your own. Enough is enough!
      codes:
        - ERR_BAD_WEB_HOOKS

That will make it very clear (different text, but today is April 1st). WDYT?

@dguendisch
Copy link
Member

I explained that badly, I'll try to make it more clear: I understood that vlerenc meant to put the cluster into Failed state as soon as offending webhooks are detected, hence stopping all further attempts to reconcile/maintain. This is what I'd consider a "fail-fast" approach: instead of trying to reconcile/maintain that cluster, fail immediately instead (by putting it in Failed state) thereby preventing any potential follow-up issues that could arise when arbitrary steps of reconciliation fail due to bad webhooks.

@rfranzke
Copy link
Member

rfranzke commented Apr 4, 2022

I am more in favor of just making the constraints more visible / similarly visible like for a "failed shoot" in the dashboard. Setting such clusters to Failed can quickly end up in a large number of clusters which are not reconciled at all for a long time which might come with its own challenges. If the clusters break because the user doesn't follow Kubernetes best-practices and we can't help then the user gets at least aware and will probably check the dashboard for hints why this is. Not reconciling such clusters and keeping them alive for long will probably become problematic at some point in time.

@vlerenc
Copy link
Member

vlerenc commented Apr 4, 2022

@rfranzke OK, so you are saying, the risk of not reconciling clusters with problematic web hooks for longer is possibly worse in terms of dev/ops effort? OK, can well be, then let's try your approach first

@grolu:

Maybe we need a more prominent indicator along with a better explanation.

@rfranzke suggested "similarly visible like for a "failed shoot" in the dashboard". What do you say?

@grolu
Copy link
Contributor

grolu commented Apr 5, 2022

Hmm maybe we can flag them as user issues, similar as we do for errors that require user action. We did some improvements here in the past, maybe we can follow a similar approach here. ACTION REQUIRED along with some explanation what the user needs to do. If I understand you right, the cluster should not show up as in error state. But we can show this Action Required alert box more prominent on the cluster detail page.

@vlerenc
Copy link
Member

vlerenc commented Apr 5, 2022

That sounds good @grolu. You mean this?

Screenshot 2022-04-05 at 09 10 50

How about treating status.constraints like status.conditions and give them a chip, possibly only for MaintenancePreconditionsSatisfiedand only if failed? I say that, because I treat constraints just like conditions also in the robot self-diagnosis (see e.g. canary landscape test issue no. 1953):

Screenshot 2022-04-05 at 09 14 54

@grolu
Copy link
Contributor

grolu commented Apr 5, 2022

Yes, I'll build a PoC and post a screenshot here. However, it seems to me that a failed MaintenancePreconditionsSatisfied constraint does not necessarily need to be caused by the user - it can also be caused by not running components etc. We need to clarify for which status / reason we should show a user ACTION REQUIRED error. Maybe someone can provide a bit more background.

@vlerenc
Copy link
Member

vlerenc commented Apr 5, 2022

@rfranzke Could you PTAL to @grolu question? I was interpreting the constraint as: the only way this constraint can fail is a.) a web hook by the end user that b.) effects our namespace. It's not about anything else than web hooks. And it's always end users that have broken it, if it's broken. No?

@rfranzke
Copy link
Member

rfranzke commented Apr 5, 2022

However, it seems to me that a failed MaintenancePreconditionsSatisfied constraint does not necessarily need to be caused by the user - it can also be caused by not running components etc

Why do you think so?

So far, the sole check executed for these constraints is indeed only the one for broken webhooks (ref). However, we might extend this in the future if necessary (no plans yet, though). Generally, my expectation is that any False constraints are caused by end-users/their configurations and require their attention.

@grolu
Copy link
Contributor

grolu commented Apr 5, 2022

Okay maybe I saw some with status Unknown. So if I check for status False it should always be caused by the user, right?

@grolu
Copy link
Contributor

grolu commented Apr 5, 2022

We could do something like this

Screen Shot 2022-04-05 at 18 25 58

Is this prominent enough? IDK... users already ignored the warning but maybe a red error with user error icon will help to make them aware. If we want this (or something similar) we need to talk about the texts as well as the implementation. But first let's clarify if this is the direction we want to go.

Don't get confused by the error message (Shoot cluster has been hibernated.) - I had no cluster with this error and I faked it.

@vlerenc
Copy link
Member

vlerenc commented Apr 6, 2022

I was more thinking about two red spots, one left for the overall status and one right for the new condition/chip, i.e. more red/dramatic. Just imagine your MPS instead of the API/CP icons here:

Screenshot 2022-04-06 at 04 20 21

Wouldn't that fit also better? You don't use the "user error icon" for condition/chips yet, right? If not, even more reason to not introduce something new in a place the user is not aware of today, but in the place he learned already that it's his responsibility to do something.

P.S.: Is that some "chip-naming-automation" or can the chip also be just an "M"?

@grolu
Copy link
Contributor

grolu commented Apr 6, 2022

We already show the user icon in the chips in case an error code is assigned to a condition. I'm not sure when we show it but just recently I saw the user icon next to a chip (maybe @rfranzke knowns when this can happen, if I remember right I saw this while the overall shoot status was not in error state, so just like with the new chip).
If I understand you right you want to put the overall status in user error and introduce the new chip. Of course it is feasible but I don't like it for several reasons. Introducing the new chip already was kind of "hacky" as we only show it in case the constraint fails, mix conditions with constraints and - worst of all - fake a new ERROR_CODE locally to make the Dashboard show it as an user error as we do for known user error codes (status.constraints[].codes or status.lastErrors[].codes) .
This is what I meant by

we need to talk about the [...] implementation

Now, also turning the shoot status into an error is exactly what we do not want, right - the shoot is not (yet) in an error state. We do not want to introduce these hacks as it is not representing the shoot status as it is in the shot resource. There may be other ways of making the user aware that there is something to do. I'll discuss this with @holgerkoser @petersutter and will get back to you.

P.S.: Is that some "chip-naming-automation" or can the chip also be just an "M"?

Yes, there is an automatic naming in place for unknown conditions. We can overwrite the default naming by adding it to list of known conditions and define a proper name.

@vlerenc
Copy link
Member

vlerenc commented Apr 6, 2022

Well, OK, we can also start with your approach and see whether people react or not @grolu .

@grolu
Copy link
Contributor

grolu commented Apr 6, 2022

I will get back to you after we discussed this internally.

@grolu
Copy link
Contributor

grolu commented Apr 8, 2022

Okay, so this is the current state. This time I configured a cluster so that it is actually in failed webhook state. So this is how it would actually look like:

Screen Shot 2022-04-08 at 11 14 57

Screen Shot 2022-04-08 at 11 15 07

Screen Shot 2022-04-08 at 11 15 17

Screen Shot 2022-04-08 at 11 15 30

Screen Shot 2022-04-08 at 11 15 53

@grolu
Copy link
Contributor

grolu commented Apr 8, 2022

Just one additional remark. In my PoC I "fake" this as user error by adding it manually in the Dashboard like this:

if (!this.isMaintenancePreconditionSatisfied) {
  return {
    ...this.maintenancePreconditionSatisfiedConstraint,
    codes: [
      'ERR_USER_WEBHOOK'
    ]
  }
}

@rfranzke
It would be cleaner and also more future proof if you could add such an error code to the constraint in the shoot status. This would enable us to remove this hack. Moreover, only Gardener knows if it is actually an webhook which is causing the constraint to fail. There could be other reasons, right?

@rfranzke
Copy link
Member

rfranzke commented Apr 8, 2022

Can you add the link to the best practices (https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#best-practices-and-warnings) so that the end-users have a chance to check what might be wrong?

As for the error code, please open an issue at g/g.

@grolu
Copy link
Contributor

grolu commented Apr 12, 2022

We decided to further enhance the action required alert to make the user aware that he needs to take action:
Screen Shot 2022-04-12 at 12 49 27

@vlerenc
Copy link
Member

vlerenc commented Apr 12, 2022

Well, that's less obtrusive/penetrant than the banner you suggested out-of-band.

We can still give it a try, though. The point is, we want the people to start caring.

@grolu
Copy link
Contributor

grolu commented Apr 12, 2022

Yes, we decided to go for this approach first as also in the issue we discussed, the user found this error message... we can still add the additional banner if the changes to not help to improve the situation. But I'm afraid that in that case also a banner will not help much. I think this should be clear enough.

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ops-productivity Operator productivity related (how to improve operations) component/dashboard Gardener Dashboard kind/enhancement Enhancement, improvement, extension priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants