-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nginx OOM #3314
Comments
@yvespp first, please update to the latest version 0.20.0 Also, using the setting |
@aledbf we're currently using 0.20.0, the graphs and logs are from that version. |
@yvespp can you increase the number of Nginx workers (i.e set it to 12) and try to reproduce this again? I wonder if you're hitting Luajit's memory limit. In your big cluster can you consistently reproduce this? One potentially related change since 0.17.0 is https://github.com/kubernetes/ingress-nginx/pull/2804/files#diff-cde3fffe2425ad7efaa8add1d05ae2c0R744 where it increase the payload size.
you can use server snippet to configure something like
to see memory usage. |
@yvespp how much of this are you requesting for the ingress-nginx pods? In the provided logs I don't see OOMKill, are you seeing this in kernel logs? |
Here the resources config for the nginx-ingress-controller container:
I see the OOMs in the Pod status and the event log of Kubernetes. Status (edited):
Memory started to increase form 12:05 on. Here more complete log of a crash:
|
@yvespp I suggest you remove the limits and add the /gc location mentioned by @ElvinEfendi |
Also, every time you see |
We had it running with Nginx workers not set and it it crashed as well. Will add the server snippet and increase the memory limit. |
If the change is related to pods being created or removed, a dynamic reload is enough. If there's a change in the configuration, a full reload is required. To see the reason for the reload you can add the flag |
No, as I mentioned, when a reload is triggered nginx creates new worker processes, keeping the old ones alive until |
Most reloads in our env seem to be because the ingress doesn't have active pods anymore:
This is a dev cluster and many developers are deploying stuff that is not configured correctly or not stable yet, so this is expected. |
We now have one pod running with dynamic on and all other with it off. After running fine for 4 days almost all non dynamic controllers started to increase their memory usage at the same time:
In a container of a non dynamic controller I can see that the memory is used by the nginx workers.:
I the logs a see nothing special... The reloads only take a few seconds and can't lead to the increased memory usage. One non dynamic controller (purple in the graph) use termporary more memory than the other containers. In summary: the memory explosion also happens without the dynamic config reload but in a different way and later. We also hand no external helth check fails with the non dynamice controllers. @aledbf can you please reopen the issue? Thanks! |
@yvespp please use |
I'm now running with tag
I also see diffs where I leave it running for now to see what the memory does. |
Here the memory graph of
The cause was probably that it was killed by Kubernetes because the liveness probe failed but the container did not handle that correctly: https://serverfault.com/questions/695849/services-remain-in-failed-state-after-stopped-with-systemctl
Also the external health check failed twice (red dotted line in graph), once before the crash from above. I had to roll back to version 0.20.0 with dynamic reload off because some of our ingresses still use the annotation |
@yvespp that annotation was deprecated in 0.18.0 and the code removed after the release of 0.20.0 in #3203 The replacement is called This annotation is "generic" and allows us to indicate more than just a secure backend |
@yvespp you should update your annotations |
@yvespp I had similar issues. So - I took time and rewrote deployment for myself.
and to startup I added:
I did not test which of those extra options solved my issues, I've suspicion about p.s. I had memory leaks even running 0.16.2 (some long running pods took over 30Gb memory :) ) |
After about 5 minutes of constant load (~9000 req/sec) 0.20.0 starts leaking memory at speed of about 1-1.5 GB/hour. After a day of such load nginx-ingress-controller process successfully consumes all remaining RAM and gets killed and restarted by Kubernetes. My Nginx ingress, upstream server and load runner are all 4 CPU, 32 GB machines.
If you are happy to sacrifice the metrics, don't forget to raise |
@AlexPereverzyev what warnings are you seeing from Lua? |
@ElvinEfendi, I'm seeing lots of |
@AlexPereverzyev can you post your Nginx configuration? I'm specifically interested if you have
in the config. Also are you seeing some metrics or no metrics at all? |
@ElvinEfendi, please find the config files attached - configs.zip The Lua block is not in the config, but seems it's intended: {{ if $all.DynamicConfigurationEnabled }} I can't see ingress metrics coming from Adding the following to nginx template works and I can see ingress metrics again (it also brings back the memory issue): init_worker_by_lua_block { Interestingly, ingress controller is able to release used memory if the load is turned off (at least to some degree): |
Do you still see
Does the memory keep growing unboundedly? It's expected for the memory to grow under load (not unboundedly though) because we batch the metrics in Lua per Nginx worker before flushing.
this aligns with what I said above. --
Are you using a custom configuration template? If not then it seems like a nasty bug introduced in the version you're using - sorry for that. |
@ElvinEfendi, turns out, ingress controller is sensitive to volume of data sent from Nginx workers to Probably, better hardware or less load can mitigate the issue, but here are the local function metrics() I've also switched from 8 to 4 workers, but it doesn't really make any difference except flat memory consumption reduction. |
@AlexPereverzyev thanks for debugging this further!
with this are you seeing constant (unbounded) memory growth or does it grow until some point and stays still? I wonder if allocating more CPUs would help with this - the theory is probably controller can not process the data sent fast enough and therefore they get queued at network level. It would be interesting to look at queuing on |
@ElvinEfendi, ingress-nginx-controller constantly consumes more and more memory until it gets killed and restarted. Though the explanation and fix are already available - #3436. |
What's the behavior with #3436 ? |
@AlexPereverzyev can you share information about the traffic? (number of ingresses, hostnames, number of replicas of the exposed services and RPS being handled by the ingress controller) |
@aledbf, I can to test the fix if you can provide an image. The environment description is below: Hardware (same for all nodes)Intel Core i7-3770 4 cores CPU IngressSingle node, ingress controller YAML can be found here configs.zip, but you should set Has up to 10 ingresses, but only one is used for testing. UpstreamSingle node, running 8 containers of REST service (web-api) with latency ~25ms or less. Load RunnerSingle node, JMeter 5 test script: 100 threads calling the REST service endpoint with JSON payload, in loop. The resulting ingress throughput is ~9000 RPS, with ~2.5MB incoming and ~6MB outgoing traffic. There is also separate node to run Prometheus and Grafana. |
@yvespp please update to 0.21.0 |
Hey folks, Sorry to bother on an issue closed, but I'm having this kind of trouble right now with 0.21.0 (didnt try 0.22.0 because of #3788). I'll post some graphics that display 'RAM memory available' on hosts. Every once in a while we have a 'reload peak'. After every peak, nginx process increases its size and it never goes back. It obviously consumes massive memory because of unfinished connections but it would be fine - if nginx process didn't enlarge! This is a sample of one of our servers: I have a VM dedicated to test Nginx Ingress Controller that receives no connections at all. It only stays up and syncing nginx configuration all day long without receiving a single client connection. Both machines have 40GB of RAM. (ignore the sudden peak at the end, it was a manual restart I applied). Since the machine is not used by anyone, you'll see that there are no massive peaks of memory use; but nginx processess keep getting larger and larger as well, which shows it's nothing to do with other possible kinds of loads. I'd guess it's some leak related to LUA? A fresh started nginx machines have its worker processes around 600MB of RAM each:
After a week, they are massively stuffed (8+GB):
At these levels, we run dangerously close of getting this to happen:
When this happens, since we run outside kubernetes, ingress-controller units are kept alive, all processess are alive, port is up, but it turns into a blackhole (sucking anything in, but never answering). This situation is irrecuperable on its own; something has to restart the service (like kubernetes liveliness check would do). |
@mrrandrade how many nginx workers are configured? (or running). |
@mrrandrade Now that #3788 is closed, please check again and if the issue persist open a new issue |
Mark here because we also met the same issue.When nginx reloads the memory usage will increase a peak about 56Gb. |
Please update to 0.32.0. You are using a version released on Feb 27, 2019. There are lot of fixes related to reloads and multiple NGINX updates. |
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.): yes
What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): memory, oom
Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
NGINX Ingress controller version: 0.20.0 and 0.19.0
Kubernetes version (use
kubectl version
): v1.11.4Environment:
uname -a
): 3.10.0What happened: The memory of Nginx increases slowly and then suddenly increases rapidly until the Pod gets OOM killed by Kubernetes. In the Pod I can see that the memory is used by nginx itself.
Here the memory graph of one Pod over the last 24h:
All pods last 24h:
In the log of one of the crashed Pods I see this:
With version 0.17.1 in the same cluster and dynamic config we had no such problems. 0.19.0 showed the same OOM behavior. I didn't try 0.18.
What you expected to happen: No OOMs
How to reproduce it (as minimally and precisely as possible): I don't know what causes it, doesn't happen on other, smaller clusters in our env. On the affected cluster we have 1121 Ingresses, 1384 Services and 1808 Pods.
Anything else we need to know:
Nginx Config Map:
Controller flags:
How can I debug this? Can I somehow see how much memory the lua module uses?
The text was updated successfully, but these errors were encountered: