-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
envoy pods permanently unready #3192
Comments
Are your Contour pod(s) healthy? Are there any logs in there? The readiness probe is looking at a built-in Envoy status endpoint, so it seems like either they've lost connection to Contour somehow or there's some sort of configuration they've encountered that caused them to hit an error condition. |
The contour pods are healthy and have been for weeks since I upgraded to 1.10. My logs have this line repeated thousands of times, with no other logs interspersed.
The HTTPProxy referenced in that log line is a simple delegated proxy handling
|
Could you port-forward to one of the unhealthy Envoy's when this happens next on port Also curious what the status code of the |
Interesting, http://localhost:9001/ready is returning a 503 with body I dumped the I thought I had posted this before, and it was probably inferred, but Contour does mark my HTTPProxy config as valid.
|
Interesting, Could you share the output of the |
I upgraded to Contour 1.11.0 and Envoy 1.16.2 wondering if that would solve anything, but after a couple days, I'm back to a draining envoy pod. Yes, 1/2 pods are showing ready. Here's the config_dump data. |
So, the A few questions for you:
(I should note that most of the maintainers are on leave at the moment, so we may be a little slow in getting back to you on this over the holiday season.) |
preStopHooks are the worst to debug since the logs just disappear into nothing. No worries on the timeline. This isn't an urgent issue, but one that seems worth resolving. |
Checking back in on this since I'm back from leave - thanks for putting this one into our "Needs Investigation" queue @skriss. Looks to me like we need to figure out why the Envoy is going into |
I don't see any logs coming out of the shutdown manager. I added the envoy logs to my gist, they're pretty small It may or may not be relevant, but we're not running any traffic through contour yet, so the requests that are in the logs are just random internet spam. Maybe the containers aren't responding well to minimal traffic? |
I am running into this same issue as well. We're running Contour on 40 nodes and just a couple of them will go into this same state with the readiness probes failing with a 503 response. Let me know if I can help in anyway. |
@McShauno Can you post your infrastructure / version details? |
Thanks for the logs @derekperkins, it does seem really weird that the Envoys are flipping into Draining, and it looks like in the logs you posted, it's only a few minutes after starting that something changes (when the config streams are cut). I'm wondering if something could be causing the gRPC connection to drop, and that's forcing the drain from the Envoy side? In any case, I just noticed that the Contour shutdown manager had 7 restarts, and had exited with an error. Given that's what triggers Envoy to flip into draining, it's probably worth investigating more there. If nothing else, we can put debug logging into our next release, so you can try and get some shutdown-manager logs.
It's very possible, yes. I've seen things before where no traffic and no changes meant no xDS traffic, so connections would drop. Some time ago, we added gRPC keepalives for the xDS connection, so it's probably not that, but it may be related to some |
We rolled contour into production yesterday after some final debugging and re-rolling envoy a few times, so all pods are currently healthy. We'll see if having some traffic impacts pods getting into this draining state. We also implemented external auth, so that may cause different behavior as well. Happy to roll out shutdown-manager debug or whatever else makes sense to help find the root cause. |
Thanks @derekperkins, keep us posted, and I'll see what we can do about debug info for the shutdown-manager. |
There are a bunch of logs in shutdown-manager to log what's going on. If those logs are empty, but there could be some reasons as to why they are empty. Could you look at the logs of the last restarted shutdown-manager container? If there are logs there it would tell us that this is the case. Being there are restart events on the shutdown manager container, it could happen that the shutdown-manager failed it's liveliness probe and was restarted by the kubelet. If this is the case and the kubelet restarted that container, I think the I'd be curious about the logs of the restarted container as well as maybe try taking out the liveliness probe on the shutdown-manager? I'm curious as to why it might be failing, but would also help with understanding if shutdown-manager is actually the root cause. |
@derekperkins also do you have a single log line that the shutdown-manager was started? It logs when it's first started like this:
|
Yes, I have that log for all the shutdown-manager containers I've checked. Looking at past logs, the only thing I'm seeing besides that in shutdown-manager is |
Ahh ok so that confirms that the shutdown sequence did happen. I suspect it's because of the kubelet killing the pod because of the liveliness probe. It somehow failed that and the container got the I'd dump that probe and see if the problem comes back. |
I posted in #3286 (comment) about some of the logging issues, but something to keep in mind is that the logs from the preStop hook container exec won't be visible in the We'll need to see if there's a better way to get at them (other than events), but for now I'd be interested if you took out the liveness probe if the issues went away. Or if you find the issue happens again, if you could pull the events from your cluster and see if it tells anything interesting. |
Apologies for the delayed response. We are running PKS/TKGI 1.9.1 on 50 nodes with version with Contour v1.11.0. Kubernetes version v1.18.8+vmware.1. |
same issue on kubeadm 1.20.4 vanilla level=info msg="file /ok does not exist; checking again in 1s" context=shutdownReadyHandler |
@McShauno, @fjudith, @derekperkins, there have been a few Contour versions since we last heard anything on this issue, have you seen these problems recently? |
@youngnick I will upgrade and let you know within the next couple of days. |
I can't remember exactly when it disappeared, but this has been solved for a while for us. |
Thanks for the updates @derekperkins and @McShauno. I'll wait for a few days for a check-in from @McShauno before closing this one, but looks like it's fixed somehow. |
Actually, it seems we might still be having the same problem. I upgraded to the latest version about 48h ago and here is the current pod list: You'll notice there is one pod towards the bottom with
This is the same issue we have been seeing in the past and also one of the reasons we haven't starting using Contour for all apps that need ingress. |
Just to confirm @McShauno, that one Envoy pod is responding to the Just as a reminder, the normal flow is:
So there are a few places that this could go wrong. In order for us to find the place, we're going to need to go pretty deep on the pod that has the problem. Could you forward the admin interface of the pod, and check the configuration dump? (Some help is at https://projectcontour.io/docs/v1.16.0/troubleshooting/envoy-admin-interface/ if you need it). Also, if you could check the logs for the containers, including Envoy and the shutdown manager, that would be helpful. |
I'm currently investigating envoy pod container restart alerts from our cluster and discovered this issue. I also have many |
hmm, that is a good point, and it would make sense that the shutdown-manager is not resilient against something like that. Do you have any idea how the envoy container was restarted without the whole pod being restarted? |
@youngnick i can confirm the same behavior. I'm able to reproduce the issue in multi-node kind cluster.. To me it looks like sync issue whenever the pod restarts either due to docker restart or machine reboot. Let me know if you need any further details..
|
Thanks for that @calshankar, I will try and repro myself as well. |
@youngnick when the readiness probe of a container fails only the container is restarted, the pod is not restarted for such an event. |
Thanks @youngnick. I will get this information together this week.
|
@calshankar could you share your notes on how to reproduce in kind? |
@stevesloka @youngnick this is not an issue with contour as such.. kubernetes-sigs/kind#2045 .. not sure whether this fixed or documented on the landing page for multi node cluster unfortunately kind does not provide/ensure stable IP for control plane because it is not behind dns.. Ideally it should use dns atleast for control plane components.. This issue can prop up whenever docker/machine reboots.. 🙏 & don't restart/reboot or create kind cluster again 😄 .. Not an issue from my side.. |
I think I might have this behaviour in an Azure Kubernetes Service cluster using k8s 1.19.9. I'll try upgrading k8s to see if that changes anything. Diags follow, a few IPs/names redacted. EDIT: 15 hours after upgrading this AKS cluster to k8s 1.20.7, all of the Envoy pods are healthy with 0 restarts. I'll update again if that changes. Things that might be useful:
|
I have encountered the subjected issue on our multi-node kops cluster on AWS. It's a 4 nodes cluster. PFA the screenshot where 3 of 4 envoy pods are in All the envoy container of these pods are failing with same error as reported by @derekperkins for this issue and with event message "Readiness probe failed: HTTP probe failed with statuscode: 503" . Could actually get the configuration dump from the admin interface - but cannot share it has a lot of information which needs security clearance from my organisation ! . Not sure if there is fix for this ? Could only think of possible reasons from the discussion above. |
Update: 15 hours after upgrading my AKS cluster to k8s 1.20.7, all of the Envoy pods are healthy with 0 restarts. Perhaps that fixed it for me. |
I encountered the same problem when I renewed the certificates of a cluster managed by kubeadm, after rerunning the job |
Looking at this issue, it seems like this problem is basically the symptom that something unrecoverable is wrong with your Contour install. From what I can see the initial problems are all resolved now as well. Are any watchers of this issue still having outstanding problems that haven't been resolved? |
@youngnick were my comments #3192 (comment) #3192 (comment) addressed, I did not see them being referenced anywhere. |
@youngnick Since upgrading Kubernetes in my cluster to 1.20.x, I have seen a total of two Envoy restarts, as opposed to thousands before. My Contour/Envoy install is directly from |
@Legion2 thanks, I was having trouble tracking everyone's problems since we're kinda all smooshed together here. Give me a sec to review. |
@Legion2 I actually think that this issue is not the right place to work on your problem, since it doesn't sound like you have the problem of Envoy pods being stuck at unready? If not, we should move your discussion to a new issue, I think. In that issue, we'll need a (redacted) copy of your Envoy YAML, so we can try and reproduce. Looking at our example YAMLs, I'm not sure how the situation you describe could arise, but there is a possibility we may need to handle the additional case in the shutdown manager. I apologise for not requesting this earlier and having your comments get lost in the shuffle, but we should be able to get onto them with a new issue. I'll give this another day or so for others to respond, but if noone else has any "Envoy is stuck as unready" problems, I'll close this one out. |
We added resource requests to the containers, which reduced the health probe failures and container restarts. I will open a new issue if we face increased error rates again. |
Thanks very much @Legion2, I'll close this one out then. If anyone else finds this issue, please log a fresh issue, and link back to here if you think it seems similar. |
@youngnick @stevesloka we are facing the same issue whenever we are upgrading CNI plugin. The readiness probe of envoy pod responding with status code 503 with body "DRAINING". Contour version: v1.20.1 The installation done using: kubectl apply -f https://projectcontour.io/quickstart/contour.yaml Following logs are repeatedly printed:
Shutdown-manager hook logs
Status of envoy pods:
|
Thanks for the report and the logs @phaniraj75. Would you please be able to log this as a separate issue though? That way we can avoid updating everyone involved in this train. |
What steps did you take and what happened:
Individual pods in the
envoy
daemonset start failing readiness checks for no apparent reason and never recover. I'm running a pretty vanilla contour 1.10 setup from the example folder. The only extra flag I added was--root-namespaces=projectcontour
. Deleting the pod seems to resolve the problem. I'm running envoy as a daemonset across 36 nodes, and it seems like it takes about 48 hours until another pod becomes unresponsive.What did you expect to happen:
I know that there isn't a livenessProbe configured by default, but I would expect that somewhere before 31k failed readinessProbes that the pod would restart or otherwise heal itself.
Anything else you would like to add:
Here are the only warning logs I see from envoy, but there are no errors and this happens occasionally without halting execution.
Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T18:49:28Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.14-gke.400", GitCommit:"b00b302d1576bfe28ae50661585cc516fda2227e", GitTreeState:"clean", BuildDate:"2020-11-19T09:22:49Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}
/etc/os-release
): cos-containerdThe text was updated successfully, but these errors were encountered: