-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destination controller goes into failfast mode #11279
Comments
Just hit this issue again with |
Another data point: Maybe some race condition that locks up the entire destination controller process? Only 5 of the 10 pods had this issue and were rebooted by the script |
Hi @bjoernw, sorry for the delay in responding to this. There are a couple of things that would be super helpful for digging deeper into what's going on here. I see that you've included some of the proxy logs from the destination controller pod. It would also be great to see logs from the destination container in that pod and the policy container in that pod as well. Wherever possible, getting full logs rather than snippets will help establish context. It would also be great to see the controller metrics, which can be retrieved by running |
I think I have similar issue. |
Thanks for all your patience with this issue, @bjoernw. I've looked at the logs that you sent me, and one thing that I noticed, is that the failfast errors in the destination controller only seem to start happening after that pod has received a sigterm signal. I believe that what is likely happening is that when the pod is sent a sigterm, the destination controller and policy controller containers shut down before the proxy does. When the proxy gets inbound requests for these controllers which are no longer available, it returns a failfast error. Pod shutdown ordering is a complex topic and there has been some debate about how to change the way we handle graceful shutdown here: #10379 With all of that said, this should not affect your application traffic. If a data-plane proxy gets a 503 error from the control plane (because that control plane pod is in the process of shutting down), it should retry it and get load balanced to a different destination controller pod (which is hopefully not in the process of shutting down). I want revisit some assumptions from the original description here. You mention that there is a endpoint staleness condition where traffic gets routed to pods which no longer exist. You also mention failfast errors in the destination controller proxy logs and restarting the destination controller. Are you sure that you have the causality in the right direction? Is it possible that when you encounter endpoint staleness, you restart the destination controller, and during the course of shutting down, the destination controller proxy prints a bunch of failfast error logs? |
I had to write this diagnostic script because this issue keeps happening so it's not the cause. I actually don't see my script actively terminating these pods. It must be the health check timing out that's causing k8s to terminate the pod. |
@bjoernw I think your theory about health checks failing is worth looking into in more detail. Or more generally determining why the kubelet is choosing to kill these pods... Does |
Wanted to chime in with my team's experience. Not completely sure if this is the same issue or not, but it look very similar. We occasionally observe application pods going into a loop where we see a large number of 5xx errors. Our cluster runs on EKS, linkerd 2.14.1 . Starts at 2023-10-07 01:46:32
The cluster in question has about 2000, to 3000 pods in the mesh at load. The applications all autoscale, so pods are coming in and out constantly throughout the day. Normally, everything is working great, but it occasionally get's stuck and starts reporting the above errors for every request. Restarting the destination linkerd pods does fix the issue. Interesting also, is that restarting the individual pods having connection issues also temporarily fixes the issue, although for much shorter period of time. For example, restarting linkerd-destination fixes the issue for a few days or a week. Restarting individual failing pods, fixes the issue for a few minutes or hours. Logs from the destination controller. The logs timestamps are not really correlated and come after the pod application errors from above
There is one interesting log that does come at around the same time as the errors:
|
Thank you for the additional data!
Are these logs from multiple pods or a single pod? Specifically, the first quoted log occurred ~26mins after the proxy started and the second quoted log snippet occurred ~5s after the proxy started. In any case, seeing a failfast timeout 5s after pod startup strongly indicates that the control plane did not return any actively available endpoints to the data plane.
This is helpful context.
This is surprising given that the above log indicated a proxy in this state only 5s after startup. That is, the initial lookup the proxy performed would appear not to include any available endpoints. It sounds like you're running multiple destination controller replicas? Perhaps one replica is getting into a bad state while the others remain healthy; so as you restart data plane workloads, resolutions randomly hit controller pods and the problem doesn't manifest.
These logs are benign and occur during startup (we see this happened 4s after the proxy started), as the proxy attempts to connect to the local policy controller, which may take a few seconds to begin serving its gRPC server. We probably ought to find a way to avoid logging warnings during startup...
This typically indicates that a client canceled a resolution (i.e. due to a restart, etc). These are fairly typical and probably shouldn't be logged at error level... The information you've provided lines up closely with similar reports we have received. We have been actively trying to reproduce this problem with a variety of testing strategies and have so far been unsuccessful. Also, we are going to put together a plan to get higher fidelity diagnostics about the destination controller's state. We'll provide updates on this issue as this work progresses. |
Glad to help. if there is any additional diagnostics we could run on our cluster, I'd be happy to provide that. To answer your questions:
I should have included more info on the pods. Yes, they are from separate pods. I had pulled the logs from our central logging system filtering on our application and the linkerd-proxy container so I think they were interleaved. One thing we had brainstormed when talking about the issue internally, is it might be an issue with the K8s/EKS control plane itself? Like if the k8s control plane had an increased error rate, latency, or connection errors, maybe the destination controller also fails in an unexpected way? We have a ticket open with our AWS contacts to see if we can get any more info on what happened on that side of things during the outage. We have had issues in the past where an unexpected high load on EKS/k8s control plane causes API errors at the k8s level, which can effect things like operators, CI/CD, and anything using the k8s apis.
Yes we run 3 replicas of destination controller. One other thing to note, is that the linkerd controller pods all were "healthy" and in ready state at the time of the outage. If it is getting into a bad state, health checks were not flagging it in our cluster.
Good to know that! I figured I'd include it since it stood out due to the timing of it. |
Are you scaling nodes in this situation, too? or is this in a fixed node pool? We've kicked off a few scale/burn-in tests to exercise aggressive service churn and I'm curious if node churn is a part of the equation for you as well. |
Yes, we are scaling up/down nodes as well. |
We have a theory that we're chasing down and we've identified a prometheus query that can help highlight this problem:
If you don't have a prometheus setup, you can approximate this by faking a timeseries by snapshotting the control plane's state a few times:
We expect that all service updates should flat-line for a single destination controller instance. Something like: We also expect that other destination controller instances should not experience any directly correlated failures of this nature. We're still working to confirm our hypothesis and to ultimately reproduce this issue, but in the meantime capturing data from incidents of this failure will most effectively help to confirm or disprove our theory. |
## edge-23.10.3 This edge release fixes issues in the proxy and destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, it contains other bugfixes and updates dependencies to include patches for the security advisories [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 and GHSA-c827-hfw6-qwvm. * Fixed an issue where the Destination controller could stop processing changes in the endpoints of a destination, if a proxy subscribed to that destination stops reading service discovery updates. This issue results in proxies attempting to send traffic for that destination to stale endpoints ([#11483], fixes [#11480], [#11279], and [#10590]) * Fixed a regression introduced in stable-2.13.0 where proxies would not terminate unused service discovery watches, exerting backpressure on the Destination controller which could cause it to become stuck ([linkerd2-proxy#2484] and [linkerd2-proxy#2486]) * Added `INFO`-level logging to the proxy when endpoints are added or removed from a load balancer. These logs are enabled by default, and can be disabled by [setting the proxy log level][proxy-log-level] to `warn,linkerd=info,linkerd_proxy_balance=warn` or similar ([linkerd2-proxy#2486]) * Fixed a regression where the proxy rendered `grpc_status` metric labels as a string rather than as the numeric status code ([linkerd2-proxy#2480]; fixes [#11449]) * Added missing `imagePullSecrets` to `linkerd-jaeger` ServiceAccount ([#11504]) * Updated the control plane's dependency on the `golang.google.org/grpc` Go package to include patches for [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 ([#11496]) * Updated dependencies on `rustix` to include patches for GHSA-c827-hfw6-qwvm ([linkerd2-proxy#2488] and [#11512]). [#10590]: #10590 [#11279]: #11279 [#11483]: #11483 [#11449]: #11449 [#11480]: #11480 [#11504]: #11504 [#11504]: #11512 [linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480 [linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484 [linkerd2-proxy#2486]: linkerd/linkerd2-proxy#2486 [linkerd2-proxy#2488]: linkerd/linkerd2-proxy#2488 [proxy-log-level]: https://linkerd.io/2.14/tasks/modifying-proxy-log-level/ [CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
The latest edge release, |
This stable release fixes issues in the proxy and Destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, it contains a bug fix for profile resolutions for pods bound on host ports and includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 * Control Plane * Fixed an issue where the Destination controller could stop processing changes in the endpoints of a destination, if a proxy subscribed to that destination stops reading service discovery updates. This issue results in proxies attempting to send traffic for that destination to stale endpoints ([#11483], fixes [#11480], [#11279], [#10590]) * Fixed an issue where the Destination controller would not update pod metadata for profile resolutions for a pod accessed via the host network (e.g. HostPort endpoints) ([#11334]) * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several dependencies (including Go's gRPC and net libraries) * Proxy * Fixed a regression where the proxy rendered `grpc_status` metric labels as a string rather than as the numeric status code ([linkerd2-proxy#2480]; fixes [#11449]) * Fixed a regression introduced in stable-2.13.0 where proxies would not terminate unusred service discovery watches, exerting backpressure on the Destination controller which could cause it to become stuck ([linkerd2-proxy#2484]) [#10590]: #10590 [#11279]: #11279 [#11483]: #11483 [#11480]: #11480 [#11334]: #11334 [#11449]: #11449 [CVE-2023-44487]: GHSA-qppj-fm5r-hxr3 [linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480 [linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484 Signed-off-by: Matei David <[email protected]>
This stable release fixes issues in the proxy and Destination controller which can result in Linkerd proxies sending traffic to stale endpoints. In addition, it contains a bug fix for profile resolutions for pods bound on host ports and includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 * Control Plane * Fixed an issue where the Destination controller could stop processing changes in the endpoints of a destination, if a proxy subscribed to that destination stops reading service discovery updates. This issue results in proxies attempting to send traffic for that destination to stale endpoints ([#11483], fixes [#11480], [#11279], [#10590]) * Fixed an issue where the Destination controller would not update pod metadata for profile resolutions for a pod accessed via the host network (e.g. HostPort endpoints) ([#11334]) * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several dependencies (including Go's gRPC and net libraries) * Proxy * Fixed a regression where the proxy rendered `grpc_status` metric labels as a string rather than as the numeric status code ([linkerd2-proxy#2480]; fixes [#11449]) * Fixed a regression introduced in stable-2.13.0 where proxies would not terminate unusred service discovery watches, exerting backpressure on the Destination controller which could cause it to become stuck ([linkerd2-proxy#2484]) [#10590]: #10590 [#11279]: #11279 [#11483]: #11483 [#11480]: #11480 [#11334]: #11334 [#11449]: #11449 [CVE-2023-44487]: GHSA-qppj-fm5r-hxr3 [linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480 [linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484 --------- Signed-off-by: Matei David <[email protected]> Co-authored-by: Alejandro Pedraza <[email protected]> Co-authored-by: Oliver Gould <[email protected]>
We are confident that destination controller staleness issues due to a deadlock condition are now available in stable-2.14.2. Please feel free to reopen if you encounter this issue after upgrading. |
@DavidMcLaughlin Could you please apply it to |
Thanks, Team! |
What is the issue?
We use canaries with ArgoRollout via SMI meaning there is a lot of endpointslice churn due to frequent release trains and their 10-15min canary rollouts. (~300-400 meshed pods)
Occasionally we will see the destination service go into failfast mode, which is problematic during a canary deployment because old IPs of pods that were already torn down by ArgoRollout are still receiving traffic even though they should not. To remedy this we need to restart the destination deployment.
How can it be reproduced?
This happens intermittently under load. We run 5 instances of each control plane component.
Logs, error output, etc
The following are all logs from destination pods:
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: