Skip to content
This repository has been archived by the owner on Jul 26, 2022. It is now read-only.

The operator hangs and stops polling/upserting secrets #362

Closed
vasrem opened this issue Apr 28, 2020 · 24 comments
Closed

The operator hangs and stops polling/upserting secrets #362

vasrem opened this issue Apr 28, 2020 · 24 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@vasrem
Copy link
Contributor

vasrem commented Apr 28, 2020

While the operator runs for some time, it suddenly stops processing. The sync_calls Prometheus metric stops accumulating, no new logs are produced, no errors in the logs, no new or old secrets are processed any more. In order to mitigate the issue, one has to restart the operator. Prometheus metrics regarding CPU, Memory and Network traffic are projected below.
CPU:
Screenshot from 2020-04-28 09-19-47
Memory:
Screenshot from 2020-04-28 09-21-36
Traffic:
Screenshot from 2020-04-28 09-21-25

  • Version: 3.1
  • Backend: AWS Secrets Manager
  • Kubernetes: 1.15
@Flydiverny Flydiverny added the bug Something isn't working label Apr 28, 2020
@Flydiverny
Copy link
Member

Anything interesting in the last logs produced before it stops? Does there seem to be any connection, ie does it always stop at the same place? Not sure where to start digging for this one.

Seeing how all network traffic kind of dies it seems reasonable that the watcher somehow disconnects and dies or just hangs.

The next time you get this state can you try creating a new external secret or editing an existing one and see if that yields any reaction?

@vasrem
Copy link
Contributor Author

vasrem commented Apr 28, 2020

Hey @Flydiverny, nothing interesting in the logs as I already mentioned. Also, the log entry is different everytime and its about upserting a random secret.

The next time you get this state can you try creating a new external secret or editing an existing one and see if that yields any reaction?

Tried adding an ExternalSecret CR and that's how I figured out it's broken. It didn't reconcile it at all.

@anarcher
Copy link

anarcher commented May 6, 2020

Hi, I have same problem with this issue. When the operator (3.1.0) is long-running, this operator hangs and stops polling and upserting secrets. but I can't see any logs related to this situation ㅠ.ㅠ

@MasayaAoyama
Copy link

Hi, we have the same problem with GCP backend.
So maybe the bug is located on a common code.

version: 3.1.0
backend: gcp

@2ffs2nns
Copy link

same issue

Version: 3.1
Backend: AWS Secrets Manager
Kubernetes: 1.14/EKS

@jseriff
Copy link

jseriff commented Jun 30, 2020

Seems widespread as we are also seeing it

Version: 4.0
Backend: Vault
Kubernetes: 1.16 GKE

As mentioned above, it stops with no interesting logs, or statuses. Killing the pod, will generate a new pod that will operate fine for several hours or days before freezing again.

@Flydiverny Flydiverny added the help wanted Extra attention is needed label Jun 30, 2020
@ckcks12
Copy link

ckcks12 commented Jul 13, 2020

same here.

Version: 3.1.0
Backend: AWS Secret Manager
Kubernetes: 1.14/EKS

btw when I looked into the source and its logs, there's one thing I don't understand.

{"level":30,"time":1594617499489,"pid":17,"hostname":"kubernetes-external-secrets-7bdfb45fc5-9t2c9","msg":"starting poller for infra-admin/infra-admin"

Every 2 mintues (+- 10 secs I guess) it starts a new poller for every external secret CRDs.

Futhermore, there were many of this below logs just before it stopped working.

{"level":50,"time":1594369711022,"pid":17,"hostname":"kubernetes-external-secrets-7bdfb45fc5-bhzsm","code":429,"statusCode":429,"msg":"status check went boom for infra-admin/infra-admin","stack":"Error: Too many requests, please try again later.\n\n at /app/node_modules/kubernetes-client/backends/request/client.js:231:25\n at Request._callback (/app/node_modules/kubernetes-client/backends/request/client.js:168:14)\n at Request.self.callback (/app/node_modules/request/request.js:185:22)\n at Request.emit (events.js:210:5)\n at Request.EventEmitter.emit (domain.js:476:20)\n at Request. (/app/node_modules/request/request.js:1161:10)\n at Request.emit (events.js:210:5)\n at Request.EventEmitter.emit (domain.js:476:20)\n at IncomingMessage. (/app/node_modules/request/request.js:1083:12)\n at Object.onceWrapper (events.js:299:28)","type":"Error","v":1}

AFAIK, according to the source, the poller starts logs shouldn't be generated every 2 minutes.

Well, I ain't sure if it'll help you debug because other people haven't mentioned this symptom, tho.

@sleerssen
Copy link

sleerssen commented Aug 28, 2020

Same here, although the last log message we see when hangs is a connection timeout:

{"level":50,"time":1598563065478,"pid":17,"hostname":"external-secrets-649757555d-c2zcw","errno":"ETIMEDOUT","code":"ETIMEDOUT","syscall":"connect","address":"172.20.0.1","port":443,"msg":"failure while polling the secret foo/bar","stack":"Error: connect ETIMEDOUT 172.20.0.1:443\n    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1128:14)","type":"Error","v":1}

ES version: 3.1.0
Backend: AWS Param Store
AWS EKS 1.15

@chadlwilson
Copy link
Contributor

This just happened to us on

Version: 4.2.0
Backend: AWS Secret Manager
Kubernetes: 1.15/EKS
Additional Components: kube2iam

First time we have observed this being a problem though, in maybe a year running in various environments.

No errors or weird things logged at all; it just seemed to stop polling on existing secrets, and also didn't take any action on a new ExternalSecret resource created within the cluster.

Short of resolving the root problem, is there a way that internals could be exposed (perhaps on the metrics port) so that a K8S liveness check could detect this problem and have K8S restart the pod automatically to recover? Currently we plan to just alert if the sync_calls metrics stop increasing, but that requires manual intervention to resolve.

@chromiumdioxide
Copy link

We have run into this very issue.
Version: 5.1.0 Backend: AWS Secret Manager Kubernetes: 1.17.9/EKS

The last log line was the following (I starred-out the secret name):

{"level":30,"time":1598656979570,"pid":17,"hostname":"kubernetes-external-secrets-67b8b85477-gb9z7","msg":"upserting secret ********"}

After that the app hung and nothing more was logged until the pod was restarted. Is there a way to check the container health while it is running (other than inspecting logs) that can be attached to k8s liveness probe to work around this issue?

@gofman8
Copy link

gofman8 commented Oct 27, 2020

Same issue on version 6.0.0

@moolen
Copy link
Member

moolen commented Oct 27, 2020

To sum this up, it is happening from version 3.1.0 to 6.0.0, the issue is independent of a specific backend implementation or kubernetes version.

Can someone consistently reproduce the issue? Did someone run kes with loglevel=debug ?

I suspect this may be called somehow (or at least that's the only thing that would cause such behavior).

https://github.com/godaddy/kubernetes-external-secrets/blob/c7849f028842c6d3c92a975e93cc41cdcb19c5a7/lib/daemon.js#L40-L45

@Flydiverny
Copy link
Member

Think one of the main issues is that the watcher hangs / gives up eventually, would probably need to swap the dependency used for kubernetes api consumption to using something like https://github.com/kubernetes-client/javascript and using the included informer https://github.com/kubernetes-client/javascript/blob/master/src/informer.ts api.

@kotire
Copy link

kotire commented Oct 28, 2020

Same issue with v6.0.0 ES, Vault 1.4.0 and Kube v1.17.4. Only restart of ES pod fix this. It can sync secrets from Vault normally from 1 to 7 hours than stops. No informative logs with log_level: debug.

@moolen
Copy link
Member

moolen commented Oct 31, 2020

I think i managed to (kind of?) reproduce the issue with minikube, here's what i did:

  1. start minikube minikube start
  2. deploy kes: helm install ./charts/kubernetes-external-secrets/ --generate-name
  3. create an arbitrary external secret kubectl apply -f ./examples/hello-service-external-secret.yml
  4. get pod ip kubectl get po -l app.kubernetes.io/name=kubernetes-external-secrets -o wide - in my case 172.17.0.3

In one shell, tail the logs of the kes application. It will continuously spill out logs. Expect sync errors here.

ssh into minikube and list the TCP connections between kes and api server, there should be one in ESTABLISHED state:

$ sudo conntrack -L conntrack -s 172.17.0.3 -d 10.96.0.1 | grep EST
conntrack v1.4.5 (conntrack-tools): 3 flow entries have been shown.
tcp      6 86387 ESTABLISHED src=172.17.0.3 dst=10.96.0.1 sport=39042 dport=443 src=192.168.39.12 dst=172.17.0.3 sport=8443 dport=39042 [ASSURED] mark=0 use=1

If i drop the connection conntrack -D conntrack -s 172.17.0.3 -d 10.96.0.1 kes stops spilling out logs, no error is shown and the sync_calls counter stops incrementing. With tcpdump i see TCP resets (which are expected) but no reconnect attempts from the node client.

If the API Server is unavailable or restarts abruptly kes restarts immediately (which is fine, i guess).

If i'm not mistaken we would have to generate a proper kubernetes client spec before being able to use List/Watch, right? kubernetes-client/javascript#341.

@Flydiverny
Copy link
Member

Flydiverny commented Nov 1, 2020

@moolen should work out of the box.
Took a stab at using the informer here #531 :)
Sadly it also seems to hang when doing the same reproduction as above.

Edit:

Dug around some more in the various issues in the javascript client and eventually landed in something by walmartlabs
https://github.com/walmartlabs/cookie-cutter/blob/cf102b33ecf987749e692d1550448eab8efa7afd/packages/kubernetes/src/KubernetesWatchSource.ts#L106-L111
Where they reimplement some of the informer logic and have a timeout to restart the watch if it doesn't receive any events in X time. It's a bit sad, but I suppose its a solution.

@moolen
Copy link
Member

moolen commented Nov 1, 2020

I see, i'm not so familiar with the node client 😅
The solution from walmartlabs you found will solve this issue, great!

@kotire
Copy link

kotire commented Nov 3, 2020

The solution from this PR #532 helped me.

@MPV
Copy link

MPV commented Nov 15, 2020

Is there anyone here who has put any alerting in place to be notified when KES stops working like this?

I'm trying to figure out a resilient way of detecting when this happens (for example by using the Prometheus metrics for KES).

@moolen
Copy link
Member

moolen commented Nov 15, 2020 via email

@dudicoco
Copy link

dudicoco commented Dec 4, 2020

I've resolved this by implementing a liveness probe, described here: #266 (comment)

@MPV
Copy link

MPV commented Dec 5, 2020

Would we be okay with implementing that kind of liveness probe here (in the helm chart for KES)?

https://github.com/external-secrets/kubernetes-external-secrets/blob/master/charts/kubernetes-external-secrets/templates/deployment.yaml

@moolen
Copy link
Member

moolen commented Dec 5, 2020

I think it would be helpful to have it configurable through values for now as a workaround. We should work on implementing a proper /healthz endpoint that detects a broken/stale recon-loop (more on that in #266)

@Flydiverny
Copy link
Member

#532 has been merged to master and released with 6.1.0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests