-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linkerd-proxy occasionally consumes 100% CPU for Nginx ingress controllers #4329
Comments
Thanks for sharing this, @ericsuhong I'm running a test now to see if I can reproduce similar results. Can you run the commands below to get the metrics from the Linkerd proxy exhibiting this behavior? kubectl port-forward po <pod-name> linkerd-proxy 4191 curl http://localhost:4191/metrics |
Just grabbed the result from affected linkerd-proxy container: |
thanks @ericsuhong there's a good amount of info here, so I'll sift through this and see what I can find. Is it possible to get the access log from the nginx ingress controller at debug log level? That will help to see where the requests are coming from and going. If there's sensitive info in there, you can email it or DM it to me on slack |
@cpretzer Just to confirm, you want debug-level logs from linkerd-proxy containers right? (not ingress controller logs) |
I'm glad you clarified, I'm definitely looking for the logs from the nginx ingress controller and not the linkerd proxy. |
I looked at (non-debug) nginx-ingress logs during this time period, and did not find any requests hitting ingress controller during this time period when linkerd-proxy started to spike up. We will enable debug logs for both linkerd-proxy and ingress controllers, and will provide results when this issue reproduce again. To make your debug process easier, here are metrics from both normal and bad linkerd-proxy within the same node from ingress controllers: |
@ericsuhong I ran a quick test using the nginx-ingress-controller version 0.26.2 and saw some similar behavior. Can you try upgrading the ingress controller to the latest version? I used |
@ericsuhong In my continued testing, I found that I can trigger 503 errors and increased CPU usage by configuring a load generator to use the service name of the ingress controller. Do you have any services in your cluster that call the nginx ingress by its internal Specifically, I'm using the emojivoto application in my tests and the load generating service, vote-bot, has an environment variable that is used to configure the host and port to which the requests are sent. When I configure the However, if I configure I think that this has to do with headers not being passed properly, which is causing the "tight loop" behavior that you describe. |
@cpretzer No, we do not have any services calling nginx ingress by its internal service name (all traffic come from external using DNS name). We also do not see any 500/503 errors in the logs as well (actually, we do not see any traffic hitting at all during this time period). We are also upgrading nginx ingress controller to 0.30.0 as you suggested. We will be running in PROD with all debug logs enabled so that when reprod happen next time, we can get debug logs for investigation. |
thanks for the update @ericsuhong Can you share your |
Here it is:
|
@ericsuhong just following up on this. Were you able to reproduce the behavior with an updated version of the nginx ingress controller? After running my test environment for about five days, I wasn't able to reproduce the behavior you described. I'm going to give it one more try, though. The one thing that we didn't get from you is the nginx access log. Were you able to collect those? |
We had to revert nginx ingress controller back to 0.26.2 because we found one regression related to TLS cipher suite. We are currently running with edge-20.5.1, and have not faced this issue yet. Regarding nginx access logs, there were literally no access logs during this time period when linkerd-proxy started to consume 100% CPU. |
@ericsuhong that's good news about the edge release. I looked through the metrics files that you sent and focused on the DNS requests, because the warning message in original post mentioned Did you ever get a trace log from the Linkerd proxy when this happened? |
What regression? Can you post more details? |
Default TLS suites used by ingress controller changed, and our internal tool was flagging us for using a "weak" TLS cipher suites. Later, we found out that TLS suites can be overwritten using ssl-ciphers configuration value, so this is actually not blocking us from upgrading ingress controllers now. But we just decided to stay with 0.26.2 for now. |
@cpretzer unfortunately, all the logs are gone. Also, trace logs are super expensive, and we are not going to enable it for our PROD environment. However, we are running linkerd in PROD with debug log level, so next time we face this issue, we will post debug log messages here. |
This is fixed in 0.32.0 kubernetes/ingress-nginx#5490
Actually you should update to 0.32.0. Between these two releases, there are several fixes related to high CPU utilization and NGINX fixes related to HTTP/2 and SSL. |
We will certainly do so! We just had so many variables going on during our last PROD deployments, so we decided to lock down ingress controller version to reduce one variable. |
@ericsuhong totally understand about the logging. If this happens again with the latest edge, please do collect whatever logs you can so that I can have a look. @aledbf Thanks for sharing that! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Bug Report
What is the issue?
We have noticed that Linkerd-proxy containers occasionally get stuck in a "tight loop", starting to consume Max CPU (1000 millicores) for Nginx ingress controllers:
This doesn't seem to be caused by a traffic load, as it has been reproduced in test clusters with a very insignificant traffic volume hitting Nginx ingress controllers.
How can it be reproduced?
Logs, error output, etc
I did not see any meaningful errors/requests around the time when linkerd2-proxy started to consume high CPU.
However, afterwards, I see following error messages:
linkerd check
outputEnvironment
Possible solution
Additional context
Same issue as #3785
The text was updated successfully, but these errors were encountered: