The operator hangs and stops polling/upserting secrets #362

vasrem · 2020-04-28T07:41:20Z

While the operator runs for some time, it suddenly stops processing. The sync_calls Prometheus metric stops accumulating, no new logs are produced, no errors in the logs, no new or old secrets are processed any more. In order to mitigate the issue, one has to restart the operator. Prometheus metrics regarding CPU, Memory and Network traffic are projected below.
CPU:

Memory:

Traffic:

Version: 3.1
Backend: AWS Secrets Manager
Kubernetes: 1.15

The text was updated successfully, but these errors were encountered:

Flydiverny · 2020-04-28T09:01:06Z

Anything interesting in the last logs produced before it stops? Does there seem to be any connection, ie does it always stop at the same place? Not sure where to start digging for this one.

Seeing how all network traffic kind of dies it seems reasonable that the watcher somehow disconnects and dies or just hangs.

The next time you get this state can you try creating a new external secret or editing an existing one and see if that yields any reaction?

vasrem · 2020-04-28T13:02:18Z

Hey @Flydiverny, nothing interesting in the logs as I already mentioned. Also, the log entry is different everytime and its about upserting a random secret.

The next time you get this state can you try creating a new external secret or editing an existing one and see if that yields any reaction?

Tried adding an ExternalSecret CR and that's how I figured out it's broken. It didn't reconcile it at all.

anarcher · 2020-05-06T04:48:19Z

Hi, I have same problem with this issue. When the operator (3.1.0) is long-running, this operator hangs and stops polling and upserting secrets. but I can't see any logs related to this situation ㅠ.ㅠ

MasayaAoyama · 2020-06-12T15:24:51Z

Hi, we have the same problem with GCP backend.
So maybe the bug is located on a common code.

version: 3.1.0
backend: gcp

2ffs2nns · 2020-06-29T19:18:04Z

same issue

Version: 3.1
Backend: AWS Secrets Manager
Kubernetes: 1.14/EKS

jseriff · 2020-06-30T18:32:09Z

Seems widespread as we are also seeing it

Version: 4.0
Backend: Vault
Kubernetes: 1.16 GKE

As mentioned above, it stops with no interesting logs, or statuses. Killing the pod, will generate a new pod that will operate fine for several hours or days before freezing again.

ckcks12 · 2020-07-13T05:36:52Z

same here.

Version: 3.1.0
Backend: AWS Secret Manager
Kubernetes: 1.14/EKS

btw when I looked into the source and its logs, there's one thing I don't understand.

{"level":30,"time":1594617499489,"pid":17,"hostname":"kubernetes-external-secrets-7bdfb45fc5-9t2c9","msg":"starting poller for infra-admin/infra-admin"

Every 2 mintues (+- 10 secs I guess) it starts a new poller for every external secret CRDs.

Futhermore, there were many of this below logs just before it stopped working.

{"level":50,"time":1594369711022,"pid":17,"hostname":"kubernetes-external-secrets-7bdfb45fc5-bhzsm","code":429,"statusCode":429,"msg":"status check went boom for infra-admin/infra-admin","stack":"Error: Too many requests, please try again later.\n\n at /app/node_modules/kubernetes-client/backends/request/client.js:231:25\n at Request._callback (/app/node_modules/kubernetes-client/backends/request/client.js:168:14)\n at Request.self.callback (/app/node_modules/request/request.js:185:22)\n at Request.emit (events.js:210:5)\n at Request.EventEmitter.emit (domain.js:476:20)\n at Request. (/app/node_modules/request/request.js:1161:10)\n at Request.emit (events.js:210:5)\n at Request.EventEmitter.emit (domain.js:476:20)\n at IncomingMessage. (/app/node_modules/request/request.js:1083:12)\n at Object.onceWrapper (events.js:299:28)","type":"Error","v":1}

AFAIK, according to the source, the poller starts logs shouldn't be generated every 2 minutes.

Well, I ain't sure if it'll help you debug because other people haven't mentioned this symptom, tho.

sleerssen · 2020-08-28T18:37:56Z

Same here, although the last log message we see when hangs is a connection timeout:

{"level":50,"time":1598563065478,"pid":17,"hostname":"external-secrets-649757555d-c2zcw","errno":"ETIMEDOUT","code":"ETIMEDOUT","syscall":"connect","address":"172.20.0.1","port":443,"msg":"failure while polling the secret foo/bar","stack":"Error: connect ETIMEDOUT 172.20.0.1:443\n    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1128:14)","type":"Error","v":1}

ES version: 3.1.0
Backend: AWS Param Store
AWS EKS 1.15

chadlwilson · 2020-09-16T03:41:12Z

This just happened to us on

Version: 4.2.0
Backend: AWS Secret Manager
Kubernetes: 1.15/EKS
Additional Components: kube2iam

First time we have observed this being a problem though, in maybe a year running in various environments.

No errors or weird things logged at all; it just seemed to stop polling on existing secrets, and also didn't take any action on a new ExternalSecret resource created within the cluster.

Short of resolving the root problem, is there a way that internals could be exposed (perhaps on the metrics port) so that a K8S liveness check could detect this problem and have K8S restart the pod automatically to recover? Currently we plan to just alert if the sync_calls metrics stop increasing, but that requires manual intervention to resolve.

chromiumdioxide · 2020-09-17T08:40:17Z

We have run into this very issue.
Version: 5.1.0 Backend: AWS Secret Manager Kubernetes: 1.17.9/EKS

The last log line was the following (I starred-out the secret name):

{"level":30,"time":1598656979570,"pid":17,"hostname":"kubernetes-external-secrets-67b8b85477-gb9z7","msg":"upserting secret ********"}

After that the app hung and nothing more was logged until the pod was restarted. Is there a way to check the container health while it is running (other than inspecting logs) that can be attached to k8s liveness probe to work around this issue?

gofman8 · 2020-10-27T09:48:05Z

Same issue on version 6.0.0

moolen · 2020-10-27T19:18:03Z

To sum this up, it is happening from version 3.1.0 to 6.0.0, the issue is independent of a specific backend implementation or kubernetes version.

Can someone consistently reproduce the issue? Did someone run kes with loglevel=debug ?

I suspect this may be called somehow (or at least that's the only thing that would cause such behavior).

https://github.com/godaddy/kubernetes-external-secrets/blob/c7849f028842c6d3c92a975e93cc41cdcb19c5a7/lib/daemon.js#L40-L45

Flydiverny · 2020-10-27T20:11:10Z

Think one of the main issues is that the watcher hangs / gives up eventually, would probably need to swap the dependency used for kubernetes api consumption to using something like https://github.com/kubernetes-client/javascript and using the included informer https://github.com/kubernetes-client/javascript/blob/master/src/informer.ts api.

kotire · 2020-10-28T15:37:49Z

Same issue with v6.0.0 ES, Vault 1.4.0 and Kube v1.17.4. Only restart of ES pod fix this. It can sync secrets from Vault normally from 1 to 7 hours than stops. No informative logs with log_level: debug.

moolen · 2020-10-31T21:17:32Z

I think i managed to (kind of?) reproduce the issue with minikube, here's what i did:

start minikube minikube start
deploy kes: helm install ./charts/kubernetes-external-secrets/ --generate-name
create an arbitrary external secret kubectl apply -f ./examples/hello-service-external-secret.yml
get pod ip kubectl get po -l app.kubernetes.io/name=kubernetes-external-secrets -o wide - in my case 172.17.0.3

In one shell, tail the logs of the kes application. It will continuously spill out logs. Expect sync errors here.

ssh into minikube and list the TCP connections between kes and api server, there should be one in ESTABLISHED state:

$ sudo conntrack -L conntrack -s 172.17.0.3 -d 10.96.0.1 | grep EST
conntrack v1.4.5 (conntrack-tools): 3 flow entries have been shown.
tcp      6 86387 ESTABLISHED src=172.17.0.3 dst=10.96.0.1 sport=39042 dport=443 src=192.168.39.12 dst=172.17.0.3 sport=8443 dport=39042 [ASSURED] mark=0 use=1

If i drop the connection conntrack -D conntrack -s 172.17.0.3 -d 10.96.0.1 kes stops spilling out logs, no error is shown and the sync_calls counter stops incrementing. With tcpdump i see TCP resets (which are expected) but no reconnect attempts from the node client.

If the API Server is unavailable or restarts abruptly kes restarts immediately (which is fine, i guess).

If i'm not mistaken we would have to generate a proper kubernetes client spec before being able to use List/Watch, right? kubernetes-client/javascript#341.

Flydiverny · 2020-11-01T04:54:05Z

@moolen should work out of the box.
Took a stab at using the informer here #531 :)
Sadly it also seems to hang when doing the same reproduction as above.

Edit:

Dug around some more in the various issues in the javascript client and eventually landed in something by walmartlabs
https://github.com/walmartlabs/cookie-cutter/blob/cf102b33ecf987749e692d1550448eab8efa7afd/packages/kubernetes/src/KubernetesWatchSource.ts#L106-L111
Where they reimplement some of the informer logic and have a timeout to restart the watch if it doesn't receive any events in X time. It's a bit sad, but I suppose its a solution.

moolen · 2020-11-01T11:32:18Z

I see, i'm not so familiar with the node client 😅
The solution from walmartlabs you found will solve this issue, great!

kotire · 2020-11-03T07:22:47Z

The solution from this PR #532 helped me.

MPV · 2020-11-15T09:52:24Z

Is there anyone here who has put any alerting in place to be notified when KES stops working like this?

I'm trying to figure out a resilient way of detecting when this happens (for example by using the Prometheus metrics for KES).

moolen · 2020-11-15T10:15:08Z

Using an increase of kubernetes_external_secrets_sync_calls_count should do it. Note: you probably get a false alarm if you delete all external secrets. Victor Sollerhed <[email protected]> schrieb am So., 15. Nov. 2020, 10:52:

…

Is there anyone here who has put any alerting in place to be notified when KES stops working like this? I'm trying to figure out a resilient way of detecting when this happens (for example by using the Prometheus metrics for KES). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#362 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANBHZSVTHJP5BV5E5SBNZDSP6QGHANCNFSM4MSSSH4A> .

dudicoco · 2020-12-04T20:39:24Z

I've resolved this by implementing a liveness probe, described here: #266 (comment)

MPV · 2020-12-05T07:17:49Z

Would we be okay with implementing that kind of liveness probe here (in the helm chart for KES)?

https://github.com/external-secrets/kubernetes-external-secrets/blob/master/charts/kubernetes-external-secrets/templates/deployment.yaml

moolen · 2020-12-05T12:20:50Z

I think it would be helpful to have it configurable through values for now as a workaround. We should work on implementing a proper /healthz endpoint that detects a broken/stale recon-loop (more on that in #266)

Flydiverny · 2020-12-22T03:02:57Z

#532 has been merged to master and released with 6.1.0

Flydiverny added the bug Something isn't working label Apr 28, 2020

max-lobur mentioned this issue Jun 18, 2020

Provide flag to exit after secrets are processed #408

Closed

Flydiverny added the help wanted Extra attention is needed label Jun 30, 2020

chromiumdioxide mentioned this issue Sep 17, 2020

Healthcheck for kubernetes-external-secrets for liveness probing #485

Closed

Flydiverny mentioned this issue Nov 1, 2020

refactor: use official kubernetes client for external secret watching #531

Closed

Flydiverny mentioned this issue Nov 1, 2020

feat: restart watcher if no events seen for specified period #532

Merged

1 task

Flydiverny closed this as completed Dec 22, 2020

Flydiverny mentioned this issue Jan 10, 2021

Controller keeps in loop fetching for secrets - AWS Secrets Manager #596

Closed

battlecow mentioned this issue Jun 2, 2021

Ensure that the restart timer is always started #765

Merged

djfinnoy mentioned this issue Aug 23, 2021

status check went boom #826

Closed

gabegorelick mentioned this issue Nov 1, 2021

livenessProbe #863

Closed

BbolroC mentioned this issue Feb 8, 2022

Problem on updating the external secret for s390x-knative GoogleCloudPlatform/oss-test-infra#1452

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The operator hangs and stops polling/upserting secrets #362

The operator hangs and stops polling/upserting secrets #362

vasrem commented Apr 28, 2020

Flydiverny commented Apr 28, 2020

vasrem commented Apr 28, 2020

anarcher commented May 6, 2020

MasayaAoyama commented Jun 12, 2020

2ffs2nns commented Jun 29, 2020

jseriff commented Jun 30, 2020 •

edited

Loading

ckcks12 commented Jul 13, 2020

sleerssen commented Aug 28, 2020 •

edited

Loading

chadlwilson commented Sep 16, 2020

chromiumdioxide commented Sep 17, 2020

gofman8 commented Oct 27, 2020

moolen commented Oct 27, 2020

Flydiverny commented Oct 27, 2020

kotire commented Oct 28, 2020 •

edited

Loading

moolen commented Oct 31, 2020

Flydiverny commented Nov 1, 2020 •

edited

Loading

moolen commented Nov 1, 2020 •

edited

Loading

kotire commented Nov 3, 2020

MPV commented Nov 15, 2020

moolen commented Nov 15, 2020 via email

dudicoco commented Dec 4, 2020

MPV commented Dec 5, 2020

moolen commented Dec 5, 2020

Flydiverny commented Dec 22, 2020

The operator hangs and stops polling/upserting secrets #362

The operator hangs and stops polling/upserting secrets #362

Comments

vasrem commented Apr 28, 2020

Flydiverny commented Apr 28, 2020

vasrem commented Apr 28, 2020

anarcher commented May 6, 2020

MasayaAoyama commented Jun 12, 2020

2ffs2nns commented Jun 29, 2020

jseriff commented Jun 30, 2020 • edited Loading

ckcks12 commented Jul 13, 2020

sleerssen commented Aug 28, 2020 • edited Loading

chadlwilson commented Sep 16, 2020

chromiumdioxide commented Sep 17, 2020

gofman8 commented Oct 27, 2020

moolen commented Oct 27, 2020

Flydiverny commented Oct 27, 2020

kotire commented Oct 28, 2020 • edited Loading

moolen commented Oct 31, 2020

Flydiverny commented Nov 1, 2020 • edited Loading

moolen commented Nov 1, 2020 • edited Loading

kotire commented Nov 3, 2020

MPV commented Nov 15, 2020

moolen commented Nov 15, 2020 via email

dudicoco commented Dec 4, 2020

MPV commented Dec 5, 2020

moolen commented Dec 5, 2020

Flydiverny commented Dec 22, 2020

jseriff commented Jun 30, 2020 •

edited

Loading

sleerssen commented Aug 28, 2020 •

edited

Loading

kotire commented Oct 28, 2020 •

edited

Loading

Flydiverny commented Nov 1, 2020 •

edited

Loading

moolen commented Nov 1, 2020 •

edited

Loading