Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd-proxy occasionally consumes 100% CPU for Nginx ingress controllers #4329

Closed
ericsuhong opened this issue May 4, 2020 · 21 comments
Closed

Comments

@ericsuhong
Copy link

ericsuhong commented May 4, 2020

Bug Report

What is the issue?

We have noticed that Linkerd-proxy containers occasionally get stuck in a "tight loop", starting to consume Max CPU (1000 millicores) for Nginx ingress controllers:

image

This doesn't seem to be caused by a traffic load, as it has been reproduced in test clusters with a very insignificant traffic volume hitting Nginx ingress controllers.

How can it be reproduced?

  1. Install Linkerd 2.7.0
  2. Install nginx-ingress-controller 0.26.2
  3. Inject linkerd-proxy into nginx-ingress-controllers
  4. Linker-proxy injected in nginx ingress controllers will sometimes get stuck in consuming 100% CPU

Logs, error output, etc

I did not see any meaningful errors/requests around the time when linkerd2-proxy started to consume high CPU.

However, afterwards, I see following error messages:

[ 95123.867830646s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[105714.967824222s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[107631.467845310s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[112749.870852432s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[113425.767789250s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[125833.29797704s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[130909.326794294s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    
[140372.867794360s]  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender    

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust roots are using supported crypto algorithm
√ trust roots are within their validity period
√ trust roots are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust root

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.7.0 but the latest stable version is 2.7.1
    see https://linkerd.io/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running stable-2.7.0 but cli running stable-2.7.1
    see https://linkerd.io/checks/#l5d-version-control for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if HA mode is enabled
    see https://linkerd.io/checks/#l5d-injection-disabled for hints

Environment

  • Kubernetes Version: 1.16.9
  • Cluster Environment: Azure (aks-engine)
  • Host OS: Ubuntu 16.04
  • Linkerd version: 2.7.0

Possible solution

Additional context

Same issue as #3785

@cpretzer
Copy link
Contributor

cpretzer commented May 4, 2020

Thanks for sharing this, @ericsuhong

I'm running a test now to see if I can reproduce similar results. Can you run the commands below to get the metrics from the Linkerd proxy exhibiting this behavior?

kubectl port-forward po <pod-name> linkerd-proxy 4191
curl http://localhost:4191/metrics

@ericsuhong
Copy link
Author

Just grabbed the result from affected linkerd-proxy container:

linkerd-proxy-metrics.txt

@cpretzer
Copy link
Contributor

cpretzer commented May 4, 2020

thanks @ericsuhong there's a good amount of info here, so I'll sift through this and see what I can find.

Is it possible to get the access log from the nginx ingress controller at debug log level? That will help to see where the requests are coming from and going. If there's sensitive info in there, you can email it or DM it to me on slack

@ericsuhong
Copy link
Author

@cpretzer Just to confirm, you want debug-level logs from linkerd-proxy containers right? (not ingress controller logs)

@cpretzer
Copy link
Contributor

cpretzer commented May 4, 2020

I'm glad you clarified, I'm definitely looking for the logs from the nginx ingress controller and not the linkerd proxy.

@ericsuhong
Copy link
Author

I looked at (non-debug) nginx-ingress logs during this time period, and did not find any requests hitting ingress controller during this time period when linkerd-proxy started to spike up. We will enable debug logs for both linkerd-proxy and ingress controllers, and will provide results when this issue reproduce again.

To make your debug process easier, here are metrics from both normal and bad linkerd-proxy within the same node from ingress controllers:

linkerd-proxy-normal.txt
linkerd-proxy-affected.txt

@cpretzer
Copy link
Contributor

cpretzer commented May 4, 2020

@ericsuhong I ran a quick test using the nginx-ingress-controller version 0.26.2 and saw some similar behavior.

Can you try upgrading the ingress controller to the latest version? I used 0.30.0 which is the latest in the helm stable repo and did not see any CPU issues.

@cpretzer
Copy link
Contributor

cpretzer commented May 5, 2020

@ericsuhong In my continued testing, I found that I can trigger 503 errors and increased CPU usage by configuring a load generator to use the service name of the ingress controller.

Do you have any services in your cluster that call the nginx ingress by its internal Service name?

Specifically, I'm using the emojivoto application in my tests and the load generating service, vote-bot, has an environment variable that is used to configure the host and port to which the requests are sent.

When I configure the WEB_HOST environment variable to point to nginx-ingress-controller.emojivoto:80, the nginx-ingress-controller logs show a ton of 500 errors.

However, if I configure WEB_HOST to point to <public-ip>:80 with an Ingress resource configured to route traffic to the emojivoto application, then there are no errors and the CPU usage remains normal.

I think that this has to do with headers not being passed properly, which is causing the "tight loop" behavior that you describe.

@ericsuhong
Copy link
Author

ericsuhong commented May 5, 2020

@cpretzer No, we do not have any services calling nginx ingress by its internal service name (all traffic come from external using DNS name). We also do not see any 500/503 errors in the logs as well (actually, we do not see any traffic hitting at all during this time period).

We are also upgrading nginx ingress controller to 0.30.0 as you suggested. We will be running in PROD with all debug logs enabled so that when reprod happen next time, we can get debug logs for investigation.

@cpretzer
Copy link
Contributor

cpretzer commented May 5, 2020

thanks for the update @ericsuhong

Can you share your Ingress resource definition? I meant to ask you for that sooner

@ericsuhong
Copy link
Author

Here it is:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: main
    nginx.ingress.kubernetes.io/configuration-snippet: |
      proxy_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
      grpc_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
    nginx.ingress.kubernetes.io/cors-allow-headers: authorization,cache-control,x-ms-client-request-id,x-ms-client-session-id,x-ms-command-name,x-ms-effective-locale,request-id,correlation-context,traceparent,tracestate,content-type
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-hide-headers: server
  name: myfoo-ingress
  namespace: mynamespace
spec:
  rules:
  - host: myfooservice.com
    http:
      paths:
      - backend:
          serviceName: myfoo-service
          servicePort: 80
        path: /
  tls:
  - hosts:
    - myfooservice.com

@cpretzer
Copy link
Contributor

@ericsuhong just following up on this. Were you able to reproduce the behavior with an updated version of the nginx ingress controller?

After running my test environment for about five days, I wasn't able to reproduce the behavior you described. I'm going to give it one more try, though.

The one thing that we didn't get from you is the nginx access log. Were you able to collect those?

@ericsuhong
Copy link
Author

ericsuhong commented May 12, 2020

We had to revert nginx ingress controller back to 0.26.2 because we found one regression related to TLS cipher suite.

We are currently running with edge-20.5.1, and have not faced this issue yet.

Regarding nginx access logs, there were literally no access logs during this time period when linkerd-proxy started to consume 100% CPU.

@cpretzer
Copy link
Contributor

@ericsuhong that's good news about the edge release.

I looked through the metrics files that you sent and focused on the DNS requests, because the warning message in original post mentioned trust_dns_proto. Nothing in the logs stood out, though.

Did you ever get a trace log from the Linkerd proxy when this happened?

@aledbf
Copy link

aledbf commented May 13, 2020

We had to revert nginx ingress controller back to 0.26.2 because we found one regression related to TLS cipher suite.

What regression? Can you post more details?

@ericsuhong
Copy link
Author

Default TLS suites used by ingress controller changed, and our internal tool was flagging us for using a "weak" TLS cipher suites.

Later, we found out that TLS suites can be overwritten using ssl-ciphers configuration value, so this is actually not blocking us from upgrading ingress controllers now. But we just decided to stay with 0.26.2 for now.

@ericsuhong
Copy link
Author

@cpretzer unfortunately, all the logs are gone. Also, trace logs are super expensive, and we are not going to enable it for our PROD environment.

However, we are running linkerd in PROD with debug log level, so next time we face this issue, we will post debug log messages here.

@aledbf
Copy link

aledbf commented May 14, 2020

Default TLS suites used by ingress controller changed, and our internal tool was flagging us for using a "weak" TLS cipher suites.

This is fixed in 0.32.0 kubernetes/ingress-nginx#5490

But we just decided to stay with 0.26.2 for now.

Actually you should update to 0.32.0. Between these two releases, there are several fixes related to high CPU utilization and NGINX fixes related to HTTP/2 and SSL.

@ericsuhong
Copy link
Author

ericsuhong commented May 14, 2020

We will certainly do so!

We just had so many variables going on during our last PROD deployments, so we decided to lock down ingress controller version to reduce one variable.

@cpretzer
Copy link
Contributor

@ericsuhong totally understand about the logging. If this happens again with the latest edge, please do collect whatever logs you can so that I can have a look.

@aledbf Thanks for sharing that!

@stale
Copy link

stale bot commented Aug 16, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Aug 16, 2020
@stale stale bot closed this as completed Aug 30, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants