Skip to content
This repository has been archived by the owner on Jul 26, 2022. It is now read-only.

healthchecks? #266

Closed
dudicoco opened this issue Jan 15, 2020 · 13 comments
Closed

healthchecks? #266

dudicoco opened this issue Jan 15, 2020 · 13 comments
Labels
core General to KES and not backend specific Stale

Comments

@dudicoco
Copy link

Hi,

Is there a way to perform health checks on the controller using kubernetes probes?

Thanks

@timmyers
Copy link

Something a bit more built-in would be ideal, but the following is a reasonable strategy:

livenessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - '! wget -q -O- localhost:3001/metrics | grep status=\"error\"'
  initialDelaySeconds: 30,
  periodSeconds: 30,

Checks the prometheus metrics to see if there have been any sync errors.

@dudicoco
Copy link
Author

Thanks @timmyers.
Would we want to restart the pod if there are sync errors? Doesn't seem like it's the right healthcheck to perform.

@timmyers
Copy link

It's definitely not ideal and I can think of cases it wouldn't be what you want. For our set-up though, generally there are 0 sync errors unless something seems wrong with the pod and all of its communication is failing. A restart seems to fix it, so this might be better than no health check.

@aramse
Copy link

aramse commented Mar 18, 2020

Agreed with @timmyers -- we also have syncing stop occurring without producing any errors in the logs, and a kubectl rollout restart kubernetes-external-secrets magically fixes it. I've tried the following and so far so good, but @timmyers's solution seems usable too, albeit both are of course less ideal than something more built-in.

livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - "cd ~ && wget -O es.json --no-check-certificate --header \"Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\" \"https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/apis/kubernetes-client.io/v1/namespaces/default/externalsecrets\""
  periodSeconds: 30

My backup plan may be to restart it every 5 min/run it as a CronJob 😄

@dudicoco
Copy link
Author

Thanks guys, i'll implement the probe suggested by @timmyers until a proper healthcheck is added.

@dudicoco
Copy link
Author

@timmyers @aramse do you have a suggestion for a readiness probe as well for a proper rolling update?

@dudicoco
Copy link
Author

dudicoco commented Mar 24, 2020

I've ended up removing the livenessProbe right away as it was restarting the container due to errors that are not related to the pod's communications. The container can function as expected and throw errors due to a variety of reasons such as lack of permissions to a specific secret.

@aramse
Copy link

aramse commented Mar 24, 2020

I would have readiness probe just be the same check as liveness in this case.

And that's unfortunate :( I have been pretty good with my probe so far (API call to get ExternalSecrets in the default namespace), but I'm still not sure if this accounts for all bad states that this controller can get into.

@dudicoco
Copy link
Author

@aramse so far I have only tried the solution offered by @timmyers.
Can you please elaborate on your solution for the probe?

@aramse
Copy link

aramse commented Mar 24, 2020

It's just a GET against the Kubernetes API to retrieve ExternalSecret resources from the default namespace. It's the same effect as kubectl get externalsecrets (but kubectl binary is not installed in the controller, so not a convenient option).

Again, not saying this is a comprehensive fix by any means, but in my case the controller gets into a state where it's unable to detect new ExternalSecret resources, and it's magically fixed by a restart, so this should hopefully cover that at least.

@moolen
Copy link
Member

moolen commented Dec 5, 2020

I didn't test wget liveness probe and i don't think it actually solves the problem here because the open tcp connection from the watch request is being dropped on a different network hop and checking if we can open a new connection does not detect that particular issue. The livenessprobe should just work ™️.

I'd propose to have a /healthz http endpoint that checks if 1) the reconciliation loop is running and 2) we're able to pull ExternalSecrets from the kube-apiserver.

@Flydiverny Flydiverny added the core General to KES and not backend specific label Jan 21, 2021
@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the Stale label Apr 21, 2021
@github-actions
Copy link

This issue was closed because it has been stalled for 30 days with no activity.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
core General to KES and not backend specific Stale
Projects
None yet
Development

No branches or pull requests

5 participants