Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination controller goes into failfast mode #11279

Closed
bjoernw opened this issue Aug 21, 2023 · 17 comments
Closed

Destination controller goes into failfast mode #11279

bjoernw opened this issue Aug 21, 2023 · 17 comments

Comments

@bjoernw
Copy link

bjoernw commented Aug 21, 2023

What is the issue?

We use canaries with ArgoRollout via SMI meaning there is a lot of endpointslice churn due to frequent release trains and their 10-15min canary rollouts. (~300-400 meshed pods)

Occasionally we will see the destination service go into failfast mode, which is problematic during a canary deployment because old IPs of pods that were already torn down by ArgoRollout are still receiving traffic even though they should not. To remedy this we need to restart the destination deployment.

How can it be reproduced?

This happens intermittently under load. We run 5 instances of each control plane component.

Logs, error output, etc

The following are all logs from destination pods:

[ 18093.017333s]  WARN ThreadId(01) linkerd_reconnect: Service failed error=channel closed
[951475.788267s]  WARN ThreadId(01) linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[950833.887787s]  WARN ThreadId(01) inbound:server{port=8086}:controller{addr=localhost:8086}: linkerd_stack::failfast: Service entering failfast after 1s
[950842.896953s]  WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}: linkerd_stack::failfast: Service entering failfast after 10s
[ 18093.017325s]  WARN ThreadId(01) inbound:server{port=8090}:rescue{client.addr=10.30.243.33:34358}: linkerd_app_inbound::http::server: Unexpected error error=client 10.30.243.33:34358: server: 10.30.205.39:8090: server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed error.sources=[server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed, operation was canceled: connection closed, connection closed]
[     3.751136s]  WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 14118.108714s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107) error.sources=[Transport endpoint is not connected (os error 107)]
[ 18093.017344s]  INFO ThreadId(01) inbound:server{port=8090}:rescue{client.addr=10.30.156.57:57528}: linkerd_app_core::errors::respond: gRPC request failed error=client 10.30.156.57:57528: server: 10.30.205.39:8090: server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed error.sources=[server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed, operation was canceled: connection closed, connection closed]

output of linkerd check -o short

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on --------------
    see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.13.4 but the latest stable version is 2.13.6
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane and cli versions match
    control plane running stable-2.13.6 but cli running stable-2.13.4
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies and cli versions match
    linkerd-destination-789df68585-288sg running stable-2.13.6 but cli running stable-2.13.4
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies and cli versions match
    linkerd-destination-789df68585-288sg running stable-2.13.6 but cli running stable-2.13.4
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

  • v1.24.16-eks
  • linkerd: 2.13.6

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

@bjoernw
Copy link
Author

bjoernw commented Sep 5, 2023

Just hit this issue again with Connection refused (os error 111) on the destination controller that caused a 2min outage that ultimately resolved itself because i have a diagnostic script that monitors the destination controller in particular. Any update on the triaging of this issue @adleong ?

@bjoernw
Copy link
Author

bjoernw commented Sep 5, 2023

Another data point:
I have a diagnostics script that runs linkerd diagnostics against every destination pod every minute and diffs the IPs it knows about with the IPs that the k8s endpointslices controller knows about for 5 different services. Looks like the diagnostics command timed out in this instance (which causes the script to restart the offending destination pod).

Maybe some race condition that locks up the entire destination controller process? Only 5 of the 10 pods had this issue and were rebooted by the script

@adleong
Copy link
Member

adleong commented Sep 11, 2023

Hi @bjoernw, sorry for the delay in responding to this. There are a couple of things that would be super helpful for digging deeper into what's going on here.

I see that you've included some of the proxy logs from the destination controller pod. It would also be great to see logs from the destination container in that pod and the policy container in that pod as well. Wherever possible, getting full logs rather than snippets will help establish context. It would also be great to see the controller metrics, which can be retrieved by running linkerd diagnostics controller-metrics.

@omidraha
Copy link

I think I have similar issue.
And also I added the output of this linkerd diagnostics controller-metrics command.

@adleong
Copy link
Member

adleong commented Sep 25, 2023

Thanks for all your patience with this issue, @bjoernw. I've looked at the logs that you sent me, and one thing that I noticed, is that the failfast errors in the destination controller only seem to start happening after that pod has received a sigterm signal.

I believe that what is likely happening is that when the pod is sent a sigterm, the destination controller and policy controller containers shut down before the proxy does. When the proxy gets inbound requests for these controllers which are no longer available, it returns a failfast error.

Pod shutdown ordering is a complex topic and there has been some debate about how to change the way we handle graceful shutdown here: #10379

With all of that said, this should not affect your application traffic. If a data-plane proxy gets a 503 error from the control plane (because that control plane pod is in the process of shutting down), it should retry it and get load balanced to a different destination controller pod (which is hopefully not in the process of shutting down).

I want revisit some assumptions from the original description here. You mention that there is a endpoint staleness condition where traffic gets routed to pods which no longer exist. You also mention failfast errors in the destination controller proxy logs and restarting the destination controller. Are you sure that you have the causality in the right direction? Is it possible that when you encounter endpoint staleness, you restart the destination controller, and during the course of shutting down, the destination controller proxy prints a bunch of failfast error logs?

@bjoernw
Copy link
Author

bjoernw commented Sep 26, 2023

I had to write this diagnostic script because this issue keeps happening so it's not the cause. I actually don't see my script actively terminating these pods. It must be the health check timing out that's causing k8s to terminate the pod.

@olix0r
Copy link
Member

olix0r commented Sep 28, 2023

@bjoernw I think your theory about health checks failing is worth looking into in more detail. Or more generally determining why the kubelet is choosing to kill these pods...

Does kubectl describe indicate that health checks have failed? For which ports? Are there other events that might indicate why Kubernetes is terminating these processes?

@thjwhite
Copy link

thjwhite commented Oct 9, 2023

Wanted to chime in with my team's experience. Not completely sure if this is the same issue or not, but it look very similar. We occasionally observe application pods going into a loop where we see a large number of 5xx errors. Our cluster runs on EKS, linkerd 2.14.1 .
Log Lines from application pod's linkerd-proxy:

Starts at 2023-10-07 01:46:32

[ 1577.793693s] WARN ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}:service{ns=yy name=xx port=50051}:endpoint{addr=10.43.52.17:50051}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

[ 5.540692s] INFO ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}:rescue{client.addr=10.43.26.109:57910}: linkerd_app_core::errors::respond: gRPC request failed error=logical service xx.yy.svc.cluster.local:50051: service in fail-fast error.sources=[service in fail-fast]
[ 5.540438s] WARN ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}: linkerd_stack::failfast: Service entering failfast after 3s

The cluster in question has about 2000, to 3000 pods in the mesh at load. The applications all autoscale, so pods are coming in and out constantly throughout the day. Normally, everything is working great, but it occasionally get's stuck and starts reporting the above errors for every request. Restarting the destination linkerd pods does fix the issue. Interesting also, is that restarting the individual pods having connection issues also temporarily fixes the issue, although for much shorter period of time. For example, restarting linkerd-destination fixes the issue for a few days or a week. Restarting individual failing pods, fixes the issue for a few minutes or hours.

Logs from the destination controller. The logs timestamps are not really correlated and come after the pod application errors from above
Starts at 2023-10-07 05:55:47

[ 4.771258s] WARN ThreadId(01) watch{port=4191}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 4.269445s] WARN ThreadId(01) watch{port=4191}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

There is one interesting log that does come at around the same time as the errors:
Timestamp 2023-10-07 01:27:04

time="2023-10-07T08:27:04Z" level=error msg="failed to send protocol update: rpc error: code = Canceled desc = context canceled" addr=":8086" component=server ns=zz pod=cc

@olix0r
Copy link
Member

olix0r commented Oct 10, 2023

@thjwhite

Wanted to chime in with my team's experience. Not completely sure if this is the same issue or not, but it look very similar. We occasionally observe application pods going into a loop where we see a large number of 5xx errors. Our cluster runs on EKS, linkerd 2.14.1 .

Thank you for the additional data!

Log Lines from application pod's linkerd-proxy:

Starts at 2023-10-07 01:46:32

[ 1577.793693s] WARN ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}:service{ns=yy name=xx port=50051}:endpoint{addr=10.43.52.17:50051}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

[ 5.540692s] INFO ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}:rescue{client.addr=10.43.26.109:57910}: linkerd_app_core::errors::respond: gRPC request failed error=logical service xx.yy.svc.cluster.local:50051: service in fail-fast error.sources=[service in fail-fast]
[ 5.540438s] WARN ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}: linkerd_stack::failfast: Service entering failfast after 3s

Are these logs from multiple pods or a single pod? Specifically, the first quoted log occurred ~26mins after the proxy started and the second quoted log snippet occurred ~5s after the proxy started.

In any case, seeing a failfast timeout 5s after pod startup strongly indicates that the control plane did not return any actively available endpoints to the data plane.

The cluster in question has about 2000, to 3000 pods in the mesh at load. The applications all autoscale, so pods are coming in and out constantly throughout the day. Normally, everything is working great, but it occasionally get's stuck and starts reporting the above errors for every request. Restarting the destination linkerd pods does fix the issue. Interesting also, is that restarting the individual pods having connection issues also temporarily fixes the issue, although for much shorter period of time. For example, restarting linkerd-destination fixes the issue for a few days or a week.

This is helpful context.

Restarting individual failing pods, fixes the issue for a few minutes or hours.

This is surprising given that the above log indicated a proxy in this state only 5s after startup. That is, the initial lookup the proxy performed would appear not to include any available endpoints.

It sounds like you're running multiple destination controller replicas? Perhaps one replica is getting into a bad state while the others remain healthy; so as you restart data plane workloads, resolutions randomly hit controller pods and the problem doesn't manifest.

Logs from the destination controller. The logs timestamps are not really correlated and come after the pod application errors from above Starts at 2023-10-07 05:55:47

[ 4.771258s] WARN ThreadId(01) watch{port=4191}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 4.269445s] WARN ThreadId(01) watch{port=4191}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

These logs are benign and occur during startup (we see this happened 4s after the proxy started), as the proxy attempts to connect to the local policy controller, which may take a few seconds to begin serving its gRPC server. We probably ought to find a way to avoid logging warnings during startup...

There is one interesting log that does come at around the same time as the errors: Timestamp 2023-10-07 01:27:04

time="2023-10-07T08:27:04Z" level=error msg="failed to send protocol update: rpc error: code = Canceled desc = context canceled" addr=":8086" component=server ns=zz pod=cc

This typically indicates that a client canceled a resolution (i.e. due to a restart, etc). These are fairly typical and probably shouldn't be logged at error level...


The information you've provided lines up closely with similar reports we have received. We have been actively trying to reproduce this problem with a variety of testing strategies and have so far been unsuccessful.

Also, we are going to put together a plan to get higher fidelity diagnostics about the destination controller's state.

We'll provide updates on this issue as this work progresses.

@thjwhite
Copy link

Glad to help. if there is any additional diagnostics we could run on our cluster, I'd be happy to provide that.

To answer your questions:

Are these logs from multiple pods or a single pod? Specifically, the first quoted log occurred ~26mins after the proxy started and the second quoted log snippet occurred ~5s after the proxy started.
In any case, seeing a failfast timeout 5s after pod startup strongly indicates that the control plane did not return any actively available endpoints to the data plane.

I should have included more info on the pods. Yes, they are from separate pods. I had pulled the logs from our central logging system filtering on our application and the linkerd-proxy container so I think they were interleaved. One thing we had brainstormed when talking about the issue internally, is it might be an issue with the K8s/EKS control plane itself? Like if the k8s control plane had an increased error rate, latency, or connection errors, maybe the destination controller also fails in an unexpected way? We have a ticket open with our AWS contacts to see if we can get any more info on what happened on that side of things during the outage. We have had issues in the past where an unexpected high load on EKS/k8s control plane causes API errors at the k8s level, which can effect things like operators, CI/CD, and anything using the k8s apis.

It sounds like you're running multiple destination controller replicas? Perhaps one replica is getting into a bad state while the others remain healthy; so as you restart data plane workloads, resolutions randomly hit controller pods and the problem doesn't manifest.

Yes we run 3 replicas of destination controller. One other thing to note, is that the linkerd controller pods all were "healthy" and in ready state at the time of the outage. If it is getting into a bad state, health checks were not flagging it in our cluster.

This typically indicates that a client canceled a resolution (i.e. due to a restart, etc). These are fairly typical and probably shouldn't be logged at error level...

Good to know that! I figured I'd include it since it stood out due to the timing of it.

@olix0r
Copy link
Member

olix0r commented Oct 11, 2023

@thjwhite

The applications all autoscale, so pods are coming in and out constantly throughout the day.

Are you scaling nodes in this situation, too? or is this in a fixed node pool?

We've kicked off a few scale/burn-in tests to exercise aggressive service churn and I'm curious if node churn is a part of the equation for you as well.

@thjwhite
Copy link

Yes, we are scaling up/down nodes as well.

@olix0r
Copy link
Member

olix0r commented Oct 12, 2023

We have a theory that we're chasing down and we've identified a prometheus query that can help highlight this problem:

sum (endpoints_updates{component='destination'}) by(instance, service, port)

If you don't have a prometheus setup, you can approximate this by faking a timeseries by snapshotting the control plane's state a few times:

date && linkerd diagnostics controller-metrics && sleep 60 && date && linkerd diagnostics controller-metrics

We expect that all service updates should flat-line for a single destination controller instance. Something like:

image

We also expect that other destination controller instances should not experience any directly correlated failures of this nature.

We're still working to confirm our hypothesis and to ultimately reproduce this issue, but in the meantime capturing data from incidents of this failure will most effectively help to confirm or disprove our theory.

@hawkw hawkw mentioned this issue Oct 19, 2023
hawkw added a commit that referenced this issue Oct 19, 2023
## edge-23.10.3

This edge release fixes issues in the proxy and destination controller which can
result in Linkerd proxies sending traffic to stale endpoints. In addition, it
contains other bugfixes and updates dependencies to include patches for the
security advisories [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 and GHSA-c827-hfw6-qwvm.

* Fixed an issue where the Destination controller could stop processing
  changes in the endpoints of a destination, if a proxy subscribed to that
  destination stops reading service discovery updates. This issue results in
  proxies attempting to send traffic for that destination to stale endpoints
  ([#11483], fixes [#11480], [#11279], and [#10590])
* Fixed a regression introduced in stable-2.13.0 where proxies would not
  terminate unused service discovery watches, exerting backpressure on the
  Destination controller which could cause it to become stuck
  ([linkerd2-proxy#2484] and [linkerd2-proxy#2486])
* Added `INFO`-level logging to the proxy when endpoints are added or removed
  from a load balancer. These logs are enabled by default, and can be disabled
  by [setting the proxy log level][proxy-log-level] to
  `warn,linkerd=info,linkerd_proxy_balance=warn` or similar
  ([linkerd2-proxy#2486])
* Fixed a regression where the proxy rendered `grpc_status` metric labels as a
  string rather than as the numeric status code ([linkerd2-proxy#2480]; fixes
  [#11449])
* Added missing `imagePullSecrets` to `linkerd-jaeger` ServiceAccount ([#11504])
* Updated the control plane's dependency on the `golang.google.org/grpc` Go
  package to include patches for [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 ([#11496])
* Updated dependencies on `rustix` to include patches for GHSA-c827-hfw6-qwvm
  ([linkerd2-proxy#2488] and [#11512]).

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11449]: #11449
[#11480]: #11480
[#11504]: #11504
[#11504]: #11512
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484
[linkerd2-proxy#2486]: linkerd/linkerd2-proxy#2486
[linkerd2-proxy#2488]: linkerd/linkerd2-proxy#2488
[proxy-log-level]: https://linkerd.io/2.14/tasks/modifying-proxy-log-level/
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
@hawkw
Copy link
Contributor

hawkw commented Oct 20, 2023

The latest edge release, edge-23.10.3, contains changes to the proxy and Destination controller which should resolve these issues. We'd love your help validating this fix by trying out edge-23.10.3 in your environments and letting us know if you encounter any more use of stale endpoints. Thanks!

@adleong adleong removed their assignment Oct 26, 2023
mateiidavid added a commit that referenced this issue Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which
can result in Linkerd proxies sending traffic to stale endpoints. In addition,
it contains a bug fix for profile resolutions for pods bound on host ports and
includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3

* Control Plane
  * Fixed an issue where the Destination controller could stop processing
    changes in the endpoints of a destination, if a proxy subscribed to that
    destination stops reading service discovery updates. This issue results in
    proxies attempting to send traffic for that destination to stale endpoints
    ([#11483], fixes [#11480], [#11279], [#10590])
  * Fixed an issue where the Destination controller would not update pod
    metadata for profile resolutions for a pod accessed via the host network
    (e.g. HostPort endpoints) ([#11334])
  * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several
    dependencies (including Go's gRPC and net libraries)

* Proxy
  * Fixed a regression where the proxy rendered `grpc_status` metric labels as
    a string rather than as the numeric status code ([linkerd2-proxy#2480];
    fixes [#11449])
  * Fixed a regression introduced in stable-2.13.0 where proxies would not
    terminate unusred service discovery watches, exerting backpressure on the
    Destination controller which could cause it to become stuck
    ([linkerd2-proxy#2484])

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11480]: #11480
[#11334]: #11334
[#11449]: #11449
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484

Signed-off-by: Matei David <[email protected]>
mateiidavid added a commit that referenced this issue Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which
can result in Linkerd proxies sending traffic to stale endpoints. In addition,
it contains a bug fix for profile resolutions for pods bound on host ports and
includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3

* Control Plane
  * Fixed an issue where the Destination controller could stop processing
    changes in the endpoints of a destination, if a proxy subscribed to that
    destination stops reading service discovery updates. This issue results in
    proxies attempting to send traffic for that destination to stale endpoints
    ([#11483], fixes [#11480], [#11279], [#10590])
  * Fixed an issue where the Destination controller would not update pod
    metadata for profile resolutions for a pod accessed via the host network
    (e.g. HostPort endpoints) ([#11334])
  * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several
    dependencies (including Go's gRPC and net libraries)

* Proxy
  * Fixed a regression where the proxy rendered `grpc_status` metric labels as
    a string rather than as the numeric status code ([linkerd2-proxy#2480];
    fixes [#11449])
  * Fixed a regression introduced in stable-2.13.0 where proxies would not
    terminate unusred service discovery watches, exerting backpressure on the
    Destination controller which could cause it to become stuck
    ([linkerd2-proxy#2484])

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11480]: #11480
[#11334]: #11334
[#11449]: #11449
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484

---------

Signed-off-by: Matei David <[email protected]>
Co-authored-by: Alejandro Pedraza <[email protected]>
Co-authored-by: Oliver Gould <[email protected]>
@DavidMcLaughlin
Copy link
Contributor

We are confident that destination controller staleness issues due to a deadlock condition are now available in stable-2.14.2. Please feel free to reopen if you encounter this issue after upgrading.

@hadican
Copy link

hadican commented Oct 31, 2023

@DavidMcLaughlin Could you please apply it to 2.13 if applicable?

@bjoernw
Copy link
Author

bjoernw commented Nov 2, 2023

Thanks, Team!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants