healthchecks? #266

dudicoco · 2020-01-15T10:54:48Z

Hi,

Is there a way to perform health checks on the controller using kubernetes probes?

Thanks

timmyers · 2020-03-17T20:50:03Z

Something a bit more built-in would be ideal, but the following is a reasonable strategy:

livenessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - '! wget -q -O- localhost:3001/metrics | grep status=\"error\"'
  initialDelaySeconds: 30,
  periodSeconds: 30,

Checks the prometheus metrics to see if there have been any sync errors.

dudicoco · 2020-03-18T08:29:30Z

Thanks @timmyers.
Would we want to restart the pod if there are sync errors? Doesn't seem like it's the right healthcheck to perform.

timmyers · 2020-03-18T13:29:45Z

It's definitely not ideal and I can think of cases it wouldn't be what you want. For our set-up though, generally there are 0 sync errors unless something seems wrong with the pod and all of its communication is failing. A restart seems to fix it, so this might be better than no health check.

aramse · 2020-03-18T13:47:10Z

Agreed with @timmyers -- we also have syncing stop occurring without producing any errors in the logs, and a kubectl rollout restart kubernetes-external-secrets magically fixes it. I've tried the following and so far so good, but @timmyers's solution seems usable too, albeit both are of course less ideal than something more built-in.

livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - "cd ~ && wget -O es.json --no-check-certificate --header \"Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\" \"https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/apis/kubernetes-client.io/v1/namespaces/default/externalsecrets\""
  periodSeconds: 30

My backup plan may be to restart it every 5 min/run it as a CronJob 😄

dudicoco · 2020-03-24T13:14:22Z

Thanks guys, i'll implement the probe suggested by @timmyers until a proper healthcheck is added.

dudicoco · 2020-03-24T13:17:15Z

@timmyers @aramse do you have a suggestion for a readiness probe as well for a proper rolling update?

dudicoco · 2020-03-24T13:29:28Z

I've ended up removing the livenessProbe right away as it was restarting the container due to errors that are not related to the pod's communications. The container can function as expected and throw errors due to a variety of reasons such as lack of permissions to a specific secret.

aramse · 2020-03-24T14:55:03Z

I would have readiness probe just be the same check as liveness in this case.

And that's unfortunate :( I have been pretty good with my probe so far (API call to get ExternalSecrets in the default namespace), but I'm still not sure if this accounts for all bad states that this controller can get into.

dudicoco · 2020-03-24T15:27:31Z

@aramse so far I have only tried the solution offered by @timmyers.
Can you please elaborate on your solution for the probe?

aramse · 2020-03-24T16:30:07Z

It's just a GET against the Kubernetes API to retrieve ExternalSecret resources from the default namespace. It's the same effect as kubectl get externalsecrets (but kubectl binary is not installed in the controller, so not a convenient option).

Again, not saying this is a comprehensive fix by any means, but in my case the controller gets into a state where it's unable to detect new ExternalSecret resources, and it's magically fixed by a restart, so this should hopefully cover that at least.

moolen · 2020-12-05T12:30:41Z

I didn't test wget liveness probe and i don't think it actually solves the problem here because the open tcp connection from the watch request is being dropped on a different network hop and checking if we can open a new connection does not detect that particular issue. The livenessprobe should just work ™️.

I'd propose to have a /healthz http endpoint that checks if 1) the reconciliation loop is running and 2) we're able to pull ExternalSecrets from the kube-apiserver.

github-actions · 2021-04-21T02:26:15Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions · 2021-05-21T02:36:38Z

This issue was closed because it has been stalled for 30 days with no activity.

dudicoco mentioned this issue Dec 4, 2020

The operator hangs and stops polling/upserting secrets #362

Closed

Flydiverny added the core General to KES and not backend specific label Jan 21, 2021

github-actions bot added the Stale label Apr 21, 2021

github-actions bot closed this as completed May 21, 2021

gabegorelick mentioned this issue Nov 1, 2021

livenessProbe #863

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

healthchecks? #266

healthchecks? #266

dudicoco commented Jan 15, 2020

timmyers commented Mar 17, 2020

dudicoco commented Mar 18, 2020

timmyers commented Mar 18, 2020

aramse commented Mar 18, 2020 •

edited

Loading

dudicoco commented Mar 24, 2020

dudicoco commented Mar 24, 2020

dudicoco commented Mar 24, 2020 •

edited

Loading

aramse commented Mar 24, 2020

dudicoco commented Mar 24, 2020

aramse commented Mar 24, 2020 •

edited

Loading

moolen commented Dec 5, 2020

github-actions bot commented Apr 21, 2021

github-actions bot commented May 21, 2021

healthchecks? #266

healthchecks? #266

Comments

dudicoco commented Jan 15, 2020

timmyers commented Mar 17, 2020

dudicoco commented Mar 18, 2020

timmyers commented Mar 18, 2020

aramse commented Mar 18, 2020 • edited Loading

dudicoco commented Mar 24, 2020

dudicoco commented Mar 24, 2020

dudicoco commented Mar 24, 2020 • edited Loading

aramse commented Mar 24, 2020

dudicoco commented Mar 24, 2020

aramse commented Mar 24, 2020 • edited Loading

moolen commented Dec 5, 2020

github-actions bot commented Apr 21, 2021

github-actions bot commented May 21, 2021

aramse commented Mar 18, 2020 •

edited

Loading

dudicoco commented Mar 24, 2020 •

edited

Loading

aramse commented Mar 24, 2020 •

edited

Loading