-
Notifications
You must be signed in to change notification settings - Fork 404
The operator hangs and stops polling/upserting secrets #362
Comments
Anything interesting in the last logs produced before it stops? Does there seem to be any connection, ie does it always stop at the same place? Not sure where to start digging for this one. Seeing how all network traffic kind of dies it seems reasonable that the watcher somehow disconnects and dies or just hangs. The next time you get this state can you try creating a new external secret or editing an existing one and see if that yields any reaction? |
Hey @Flydiverny, nothing interesting in the logs as I already mentioned. Also, the log entry is different everytime and its about upserting a random secret.
Tried adding an ExternalSecret CR and that's how I figured out it's broken. It didn't reconcile it at all. |
Hi, I have same problem with this issue. When the operator (3.1.0) is long-running, this operator hangs and stops polling and upserting secrets. but I can't see any logs related to this situation ㅠ.ㅠ |
Hi, we have the same problem with GCP backend.
|
same issue
|
Seems widespread as we are also seeing it Version: 4.0 As mentioned above, it stops with no interesting logs, or statuses. Killing the pod, will generate a new pod that will operate fine for several hours or days before freezing again. |
same here.
btw when I looked into the source and its logs, there's one thing I don't understand.
Every 2 mintues (+- 10 secs I guess) it starts a new poller for every external secret CRDs. Futhermore, there were many of this below logs just before it stopped working.
AFAIK, according to the source, the poller starts logs shouldn't be generated every 2 minutes. Well, I ain't sure if it'll help you debug because other people haven't mentioned this symptom, tho. |
Same here, although the last log message we see when hangs is a connection timeout:
ES version: 3.1.0 |
This just happened to us on
First time we have observed this being a problem though, in maybe a year running in various environments. No errors or weird things logged at all; it just seemed to stop polling on existing secrets, and also didn't take any action on a new Short of resolving the root problem, is there a way that internals could be exposed (perhaps on the metrics port) so that a K8S |
We have run into this very issue. The last log line was the following (I starred-out the secret name):
After that the app hung and nothing more was logged until the pod was restarted. Is there a way to check the container health while it is running (other than inspecting logs) that can be attached to k8s liveness probe to work around this issue? |
Same issue on version 6.0.0 |
To sum this up, it is happening from version Can someone consistently reproduce the issue? Did someone run kes with I suspect this may be called somehow (or at least that's the only thing that would cause such behavior). |
Think one of the main issues is that the watcher hangs / gives up eventually, would probably need to swap the dependency used for kubernetes api consumption to using something like https://github.com/kubernetes-client/javascript and using the included informer https://github.com/kubernetes-client/javascript/blob/master/src/informer.ts api. |
Same issue with v6.0.0 ES, Vault 1.4.0 and Kube v1.17.4. Only restart of ES pod fix this. It can sync secrets from Vault normally from 1 to 7 hours than stops. No informative logs with log_level: debug. |
I think i managed to (kind of?) reproduce the issue with minikube, here's what i did:
In one shell, tail the logs of the kes application. It will continuously spill out logs. Expect sync errors here. ssh into minikube and list the TCP connections between kes and api server, there should be one in $ sudo conntrack -L conntrack -s 172.17.0.3 -d 10.96.0.1 | grep EST
conntrack v1.4.5 (conntrack-tools): 3 flow entries have been shown.
tcp 6 86387 ESTABLISHED src=172.17.0.3 dst=10.96.0.1 sport=39042 dport=443 src=192.168.39.12 dst=172.17.0.3 sport=8443 dport=39042 [ASSURED] mark=0 use=1 If i drop the connection If the API Server is unavailable or restarts abruptly kes restarts immediately (which is fine, i guess). If i'm not mistaken we would have to generate a proper kubernetes client spec before being able to use List/Watch, right? kubernetes-client/javascript#341. |
@moolen should work out of the box. Edit: Dug around some more in the various issues in the javascript client and eventually landed in something by walmartlabs |
I see, i'm not so familiar with the node client 😅 |
The solution from this PR #532 helped me. |
Is there anyone here who has put any alerting in place to be notified when KES stops working like this? I'm trying to figure out a resilient way of detecting when this happens (for example by using the Prometheus metrics for KES). |
Using an increase of kubernetes_external_secrets_sync_calls_count
should do it. Note: you probably get a false alarm if you delete all
external secrets.
Victor Sollerhed <[email protected]> schrieb am So., 15. Nov. 2020,
10:52:
… Is there anyone here who has put any alerting in place to be notified when
KES stops working like this?
I'm trying to figure out a resilient way of detecting when this happens
(for example by using the Prometheus metrics for KES).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#362 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANBHZSVTHJP5BV5E5SBNZDSP6QGHANCNFSM4MSSSH4A>
.
|
I've resolved this by implementing a liveness probe, described here: #266 (comment) |
Would we be okay with implementing that kind of liveness probe here (in the helm chart for KES)? |
I think it would be helpful to have it configurable through values for now as a workaround. We should work on implementing a proper |
#532 has been merged to master and released with 6.1.0 |
While the operator runs for some time, it suddenly stops processing. The
sync_calls
Prometheus metric stops accumulating, no new logs are produced, no errors in the logs, no new or old secrets are processed any more. In order to mitigate the issue, one has to restart the operator. Prometheus metrics regarding CPU, Memory and Network traffic are projected below.CPU:
Memory:
Traffic:
The text was updated successfully, but these errors were encountered: