Linkerd-proxy occasionally consumes 100% CPU for Nginx ingress controllers #4329

ericsuhong · 2020-05-04T01:28:58Z

Bug Report

What is the issue?

We have noticed that Linkerd-proxy containers occasionally get stuck in a "tight loop", starting to consume Max CPU (1000 millicores) for Nginx ingress controllers:

This doesn't seem to be caused by a traffic load, as it has been reproduced in test clusters with a very insignificant traffic volume hitting Nginx ingress controllers.

How can it be reproduced?

Install Linkerd 2.7.0
Install nginx-ingress-controller 0.26.2
Inject linkerd-proxy into nginx-ingress-controllers
Linker-proxy injected in nginx ingress controllers will sometimes get stuck in consuming 100% CPU

Logs, error output, etc

I did not see any meaningful errors/requests around the time when linkerd2-proxy started to consume high CPU.

However, afterwards, I see following error messages:

[ 95123.867830646s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[105714.967824222s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[107631.467845310s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[112749.870852432s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[113425.767789250s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[125833.29797704s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[130909.326794294s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[140372.867794360s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender

`linkerd check` output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust roots are using supported crypto algorithm
√ trust roots are within their validity period
√ trust roots are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust root

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.7.0 but the latest stable version is 2.7.1
    see https://linkerd.io/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running stable-2.7.0 but cli running stable-2.7.1
    see https://linkerd.io/checks/#l5d-version-control for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if HA mode is enabled
    see https://linkerd.io/checks/#l5d-injection-disabled for hints

Environment

Kubernetes Version: 1.16.9
Cluster Environment: Azure (aks-engine)
Host OS: Ubuntu 16.04
Linkerd version: 2.7.0

Possible solution

Additional context

Same issue as #3785

The text was updated successfully, but these errors were encountered:

cpretzer · 2020-05-04T02:16:43Z

Thanks for sharing this, @ericsuhong

I'm running a test now to see if I can reproduce similar results. Can you run the commands below to get the metrics from the Linkerd proxy exhibiting this behavior?

kubectl port-forward po <pod-name> linkerd-proxy 4191

curl http://localhost:4191/metrics

ericsuhong · 2020-05-04T02:23:40Z

Just grabbed the result from affected linkerd-proxy container:

linkerd-proxy-metrics.txt

cpretzer · 2020-05-04T17:07:27Z

thanks @ericsuhong there's a good amount of info here, so I'll sift through this and see what I can find.

Is it possible to get the access log from the nginx ingress controller at debug log level? That will help to see where the requests are coming from and going. If there's sensitive info in there, you can email it or DM it to me on slack

ericsuhong · 2020-05-04T17:21:48Z

@cpretzer Just to confirm, you want debug-level logs from linkerd-proxy containers right? (not ingress controller logs)

cpretzer · 2020-05-04T17:23:26Z

I'm glad you clarified, I'm definitely looking for the logs from the nginx ingress controller and not the linkerd proxy.

ericsuhong · 2020-05-04T17:59:41Z

I looked at (non-debug) nginx-ingress logs during this time period, and did not find any requests hitting ingress controller during this time period when linkerd-proxy started to spike up. We will enable debug logs for both linkerd-proxy and ingress controllers, and will provide results when this issue reproduce again.

To make your debug process easier, here are metrics from both normal and bad linkerd-proxy within the same node from ingress controllers:

linkerd-proxy-normal.txt
linkerd-proxy-affected.txt

cpretzer · 2020-05-04T23:54:21Z

@ericsuhong I ran a quick test using the nginx-ingress-controller version 0.26.2 and saw some similar behavior.

Can you try upgrading the ingress controller to the latest version? I used 0.30.0 which is the latest in the helm stable repo and did not see any CPU issues.

cpretzer · 2020-05-05T17:23:20Z

@ericsuhong In my continued testing, I found that I can trigger 503 errors and increased CPU usage by configuring a load generator to use the service name of the ingress controller.

Do you have any services in your cluster that call the nginx ingress by its internal Service name?

Specifically, I'm using the emojivoto application in my tests and the load generating service, vote-bot, has an environment variable that is used to configure the host and port to which the requests are sent.

When I configure the WEB_HOST environment variable to point to nginx-ingress-controller.emojivoto:80, the nginx-ingress-controller logs show a ton of 500 errors.

However, if I configure WEB_HOST to point to <public-ip>:80 with an Ingress resource configured to route traffic to the emojivoto application, then there are no errors and the CPU usage remains normal.

I think that this has to do with headers not being passed properly, which is causing the "tight loop" behavior that you describe.

ericsuhong · 2020-05-05T18:06:51Z

@cpretzer No, we do not have any services calling nginx ingress by its internal service name (all traffic come from external using DNS name). We also do not see any 500/503 errors in the logs as well (actually, we do not see any traffic hitting at all during this time period).

We are also upgrading nginx ingress controller to 0.30.0 as you suggested. We will be running in PROD with all debug logs enabled so that when reprod happen next time, we can get debug logs for investigation.

cpretzer · 2020-05-05T21:01:20Z

thanks for the update @ericsuhong

Can you share your Ingress resource definition? I meant to ask you for that sooner

ericsuhong · 2020-05-06T05:05:16Z

Here it is:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: main
    nginx.ingress.kubernetes.io/configuration-snippet: |
      proxy_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
      grpc_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
    nginx.ingress.kubernetes.io/cors-allow-headers: authorization,cache-control,x-ms-client-request-id,x-ms-client-session-id,x-ms-command-name,x-ms-effective-locale,request-id,correlation-context,traceparent,tracestate,content-type
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-hide-headers: server
  name: myfoo-ingress
  namespace: mynamespace
spec:
  rules:
  - host: myfooservice.com
    http:
      paths:
      - backend:
          serviceName: myfoo-service
          servicePort: 80
        path: /
  tls:
  - hosts:
    - myfooservice.com

cpretzer · 2020-05-11T21:33:49Z

@ericsuhong just following up on this. Were you able to reproduce the behavior with an updated version of the nginx ingress controller?

After running my test environment for about five days, I wasn't able to reproduce the behavior you described. I'm going to give it one more try, though.

The one thing that we didn't get from you is the nginx access log. Were you able to collect those?

ericsuhong · 2020-05-12T19:39:24Z

We had to revert nginx ingress controller back to 0.26.2 because we found one regression related to TLS cipher suite.

We are currently running with edge-20.5.1, and have not faced this issue yet.

Regarding nginx access logs, there were literally no access logs during this time period when linkerd-proxy started to consume 100% CPU.

cpretzer · 2020-05-13T22:23:56Z

@ericsuhong that's good news about the edge release.

I looked through the metrics files that you sent and focused on the DNS requests, because the warning message in original post mentioned trust_dns_proto. Nothing in the logs stood out, though.

Did you ever get a trace log from the Linkerd proxy when this happened?

aledbf · 2020-05-13T23:50:29Z

We had to revert nginx ingress controller back to 0.26.2 because we found one regression related to TLS cipher suite.

What regression? Can you post more details?

ericsuhong · 2020-05-14T00:09:49Z

Default TLS suites used by ingress controller changed, and our internal tool was flagging us for using a "weak" TLS cipher suites.

Later, we found out that TLS suites can be overwritten using ssl-ciphers configuration value, so this is actually not blocking us from upgrading ingress controllers now. But we just decided to stay with 0.26.2 for now.

ericsuhong · 2020-05-14T00:11:40Z

@cpretzer unfortunately, all the logs are gone. Also, trace logs are super expensive, and we are not going to enable it for our PROD environment.

However, we are running linkerd in PROD with debug log level, so next time we face this issue, we will post debug log messages here.

aledbf · 2020-05-14T00:35:27Z

Default TLS suites used by ingress controller changed, and our internal tool was flagging us for using a "weak" TLS cipher suites.

This is fixed in 0.32.0 kubernetes/ingress-nginx#5490

But we just decided to stay with 0.26.2 for now.

Actually you should update to 0.32.0. Between these two releases, there are several fixes related to high CPU utilization and NGINX fixes related to HTTP/2 and SSL.

ericsuhong · 2020-05-14T00:43:22Z

We will certainly do so!

We just had so many variables going on during our last PROD deployments, so we decided to lock down ingress controller version to reduce one variable.

cpretzer · 2020-05-14T05:02:50Z

@ericsuhong totally understand about the logging. If this happens again with the latest edge, please do collect whatever logs you can so that I can have a look.

@aledbf Thanks for sharing that!

stale · 2020-08-16T07:08:02Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

ericsuhong mentioned this issue May 4, 2020

Persistent spike in CPU usage #3785

Closed

2 tasks

grampelberg added area/proxy bug needs/repro labels May 4, 2020

grampelberg added the priority/P1 Planned for Release label May 4, 2020

stale bot added the wontfix label Aug 16, 2020

stale bot closed this as completed Aug 30, 2020

github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd-proxy occasionally consumes 100% CPU for Nginx ingress controllers #4329

Linkerd-proxy occasionally consumes 100% CPU for Nginx ingress controllers #4329

ericsuhong commented May 4, 2020 •

edited

Loading

cpretzer commented May 4, 2020

ericsuhong commented May 4, 2020

cpretzer commented May 4, 2020

ericsuhong commented May 4, 2020

cpretzer commented May 4, 2020

ericsuhong commented May 4, 2020

cpretzer commented May 4, 2020

cpretzer commented May 5, 2020

ericsuhong commented May 5, 2020 •

edited

Loading

cpretzer commented May 5, 2020

ericsuhong commented May 6, 2020

cpretzer commented May 11, 2020

ericsuhong commented May 12, 2020 •

edited

Loading

cpretzer commented May 13, 2020

aledbf commented May 13, 2020

ericsuhong commented May 14, 2020

ericsuhong commented May 14, 2020

aledbf commented May 14, 2020 •

edited

Loading

ericsuhong commented May 14, 2020 •

edited

Loading

cpretzer commented May 14, 2020

stale bot commented Aug 16, 2020

Linkerd-proxy occasionally consumes 100% CPU for Nginx ingress controllers #4329

Linkerd-proxy occasionally consumes 100% CPU for Nginx ingress controllers #4329

Comments

ericsuhong commented May 4, 2020 • edited Loading

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

cpretzer commented May 4, 2020

ericsuhong commented May 4, 2020

cpretzer commented May 4, 2020

ericsuhong commented May 4, 2020

cpretzer commented May 4, 2020

ericsuhong commented May 4, 2020

cpretzer commented May 4, 2020

cpretzer commented May 5, 2020

ericsuhong commented May 5, 2020 • edited Loading

cpretzer commented May 5, 2020

ericsuhong commented May 6, 2020

cpretzer commented May 11, 2020

ericsuhong commented May 12, 2020 • edited Loading

cpretzer commented May 13, 2020

aledbf commented May 13, 2020

ericsuhong commented May 14, 2020

ericsuhong commented May 14, 2020

aledbf commented May 14, 2020 • edited Loading

ericsuhong commented May 14, 2020 • edited Loading

cpretzer commented May 14, 2020

stale bot commented Aug 16, 2020

ericsuhong commented May 4, 2020 •

edited

Loading

`linkerd check` output

ericsuhong commented May 5, 2020 •

edited

Loading

ericsuhong commented May 12, 2020 •

edited

Loading

aledbf commented May 14, 2020 •

edited

Loading

ericsuhong commented May 14, 2020 •

edited

Loading