-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Massive lock waits and memory increasing when enable metrics #8285
Comments
@Lyt99: This issue is currently awaiting triage. If Ingress contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
From all the data you have provided, it seems like you did a perf test by loading traffic and you got memory spike. Just to state the obvious, it seems a logical expected thing for memory usage to increase when you test with heavy load. /remove-kind bug |
Yes, it's expected that memory usage increases when performing heavy load. But it seems not to be expected that metrics being enabled causes a huge memory consuming, which may cause OOM. I think we can considered this problem as a performance issue that can be solved. IIUC, the root cause of this issue is, prometheus's I can do another test with BTW, I searched for other issues, and seems #8228 relevant to this problem. |
I have not looked at the details yet, but if what you say is true, then this is a problem out of scope for ingress-nginx project I think. When there is heavy load and all cpu/memory is used up in scraping metrics and logging, it makes sense to know that recalculation or calculation or aggregation work will starve for resources and very likely OOM. I don't know if the metrics exposing code is out of box from prometheus or custom built for ingress controller, by ingress controller developers. Are you thinking of contributing and submitting PR. If not, then please provide any other details you have here and lets hope someone has time to work on it. |
I have some ideas and I will try them these days. If it works, I will submit a PR. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
I'm a bit concerned that the PR #8726 would make the problem more serious, as it's a Summary metric. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Please check #8728 |
It's great to see that metric has been deprecated. But the deprecated Is there any plan to remove this metric in further version? |
Sure, but it's breaking change for monitoring tools. If we are ready to do so, I can update the PR to remove these metrics in single step. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
Kubernetes version (use
kubectl version
):v1.22.3-aliyun.1
Environment:
Alibaba Cloud
Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle)
uname -a
):Linux iZbp1c41y5oz5mltc5ed5sZ 4.19.91-24.1.al7.x86_64 #1 SMP Wed Jul 21 17:40:23 CST 2021 x86_64 x86_64 x86_64 GNU/Linux
ecs.c7.8xlarge (32c128g)
What happened:
We performed a benchmark using Alibaba Cloud's PTS(Performance Testing Service), the volume of testing is around 500K RPS(120K on each pod). We observed a remarkable memory increasing, up to 32GiB and caused OOMs. And when we finished benchmark, the memory reduced to a normal level after a while.
We executed
top
in the pod, and it showed thatnginx-ingress-controller
had the largest memory consuming. Andnginx
seemed to work properly.Wwe got
pprof/heap
result svg hereshows that
k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).handleMessage
possesses memory when Unmarshal json bytes to structure. (https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/metric/collectors/socket.go#L240)And then we checked
pprof/goroutines
result svg hereMost goroutines are in the
github.com/prometheus/client_golang/prometheus.(*summary).Observe
and waited forMutex
to be release. We checked the code for prometheusSummary
and it does have aMutex
when doing anObserve
, because of quantile recalculation. TheSummary
is only used byupstreamLantency
.https://github.com/prometheus/client_golang/blob/main/prometheus/summary.go#L284
It could be a time-costing procedure, and block the following batches of data to be observed. And
handleMessage
will unmarshal allsocketData
into memory, which causes more memory consumption, make the lock wait problem heavier.How to reproduce it:
Do a heavy pressure benchmark to the ingress controller.
Anything else we need to know:
Possible solutions:
Histogram
instead ofSummary
forupstreamLatency
handleMessage
(may use iterator to prevent unmarshall all data into struct slice?)The text was updated successfully, but these errors were encountered: