Massive lock waits and memory increasing when enable metrics #8285

Lyt99 · 2022-03-02T08:17:51Z

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

Kubernetes version (use kubectl version):

v1.22.3-aliyun.1

Environment:

Cloud provider or hardware configuration: Alibaba Cloud
OS (e.g. from /etc/os-release): Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle)
Kernel (e.g. uname -a): Linux iZbp1c41y5oz5mltc5ed5sZ 4.19.91-24.1.al7.x86_64 #1 SMP Wed Jul 21 17:40:23 CST 2021 x86_64 x86_64 x86_64 GNU/Linux
Node specs: ecs.c7.8xlarge (32c128g)
Controller replicas: 4
Controller limits: 16core 32g

What happened:

We performed a benchmark using Alibaba Cloud's PTS(Performance Testing Service), the volume of testing is around 500K RPS(120K on each pod). We observed a remarkable memory increasing, up to 32GiB and caused OOMs. And when we finished benchmark, the memory reduced to a normal level after a while.

We executed top in the pod, and it showed that nginx-ingress-controller had the largest memory consuming. And nginx seemed to work properly.

Wwe got pprof/heap result svg here

      flat  flat%   sum%        cum   cum%
    8.33GB 65.86% 65.86%     8.33GB 65.86%  reflect.unsafe_NewArray
    3.37GB 26.64% 92.50%     4.07GB 32.16%  github.com/json-iterator/go.(*Iterator).ReadString
    0.70GB  5.52% 98.02%     0.70GB  5.52%  github.com/json-iterator/go.(*Iterator).readEscapedChar
    0.19GB  1.53% 99.55%     0.19GB  1.53%  io.ReadAll
    0.02GB  0.18% 99.73%    12.43GB 98.20%  k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).handleMessage
         0     0% 99.73%    12.40GB 98.02%  github.com/json-iterator/go.(*Iterator).ReadVal
         0     0% 99.73%     0.70GB  5.52%  github.com/json-iterator/go.(*Iterator).readStringSlowPath
         0     0% 99.73%    12.40GB 98.02%  github.com/json-iterator/go.(*frozenConfig).Unmarshal
         0     0% 99.73%     4.07GB 32.16%  github.com/json-iterator/go.(*generalStructDecoder).Decode
         0     0% 99.73%     4.07GB 32.16%  github.com/json-iterator/go.(*generalStructDecoder).decodeOneField

shows that k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).handleMessage possesses memory when Unmarshal json bytes to structure. (https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/metric/collectors/socket.go#L240)

And then we checked pprof/goroutines result svg here

Showing nodes accounting for 5018, 99.80% of 5028 total
Dropped 126 nodes (cum <= 25)
Showing top 10 nodes out of 11
      flat  flat%   sum%        cum   cum%
      5018 99.80% 99.80%       5018 99.80%  runtime.gopark
         0     0% 99.80%       4933 98.11%  github.com/prometheus/client_golang/prometheus.(*summary).Observe
         0     0% 99.80%       4933 98.11%  k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).handleMessage
         0     0% 99.80%       4933 98.11%  k8s.io/ingress-nginx/internal/ingress/metric/collectors.handleMessages
         0     0% 99.80%         29  0.58%  runtime.chanrecv
         0     0% 99.80%       4946 98.37%  runtime.goparkunlock (inline)
         0     0% 99.80%         35   0.7%  runtime.selectgo
         0     0% 99.80%       4933 98.11%  runtime.semacquire1
         0     0% 99.80%       4933 98.11%  sync.(*Mutex).Lock (inline)
         0     0% 99.80%       4933 98.11%  sync.(*Mutex).lockSlow

Most goroutines are in the github.com/prometheus/client_golang/prometheus.(*summary).Observe and waited for Mutex to be release. We checked the code for prometheus Summary and it does have a Mutex when doing an Observe, because of quantile recalculation. The Summary is only used by upstreamLantency.

https://github.com/prometheus/client_golang/blob/main/prometheus/summary.go#L284

It could be a time-costing procedure, and block the following batches of data to be observed. And handleMessage will unmarshal all socketData into memory, which causes more memory consumption, make the lock wait problem heavier.

How to reproduce it:

Do a heavy pressure benchmark to the ingress controller.

Anything else we need to know:

Possible solutions:

Use lock-free Histogram instead of Summary for upstreamLatency
In the meantime, try to reduce memory consumption for handleMessage (may use iterator to prevent unmarshall all data into struct slice?)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-03-02T08:17:58Z

@Lyt99: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

longwuyuan · 2022-03-09T04:22:36Z

From all the data you have provided, it seems like you did a perf test by loading traffic and you got memory spike. Just to state the obvious, it seems a logical expected thing for memory usage to increase when you test with heavy load.

/remove-kind bug

Lyt99 · 2022-03-09T06:47:55Z

From all the data you have provided, it seems like you did a perf test by loading traffic and you got memory spike. Just to state the obvious, it seems a logical expected thing for memory usage to increase when you test with heavy load.

/remove-kind bug

Yes, it's expected that memory usage increases when performing heavy load. But it seems not to be expected that metrics being enabled causes a huge memory consuming, which may cause OOM.

I think we can considered this problem as a performance issue that can be solved. IIUC, the root cause of this issue is, prometheus's Summary metrics need to recalculate the quantile, which is expensive and not lock-free. It will block other incoming metrics requests, and the memory they allocated can not be released in time.

I can do another test with --enable-metrics=false later to make a comparison. It should be a great decreasing on memory usage.

BTW, I searched for other issues, and seems #8228 relevant to this problem.

longwuyuan · 2022-03-09T07:40:11Z

I have not looked at the details yet, but if what you say is true, then this is a problem out of scope for ingress-nginx project I think. When there is heavy load and all cpu/memory is used up in scraping metrics and logging, it makes sense to know that recalculation or calculation or aggregation work will starve for resources and very likely OOM. I don't know if the metrics exposing code is out of box from prometheus or custom built for ingress controller, by ingress controller developers.

Are you thinking of contributing and submitting PR. If not, then please provide any other details you have here and lets hope someone has time to work on it.

Lyt99 · 2022-03-09T07:57:07Z

I have not looked at the details yet, but if what you say is true, then this is a problem out of scope for ingress-nginx project I think. When there is heavy load and all cpu/memory is used up in scraping metrics and logging, it makes sense to know that recalculation or calculation or aggregation work will starve for resources and very likely OOM. I don't know if the metrics exposing code is out of box from prometheus or custom built for ingress controller, by ingress controller developers.

Are you thinking of contributing and submitting PR. If not, then please provide any other details you have here and lets hope someone has time to work on it.

I have some ideas and I will try them these days. If it works, I will submit a PR.

k8s-triage-robot · 2022-06-07T08:08:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Lyt99 · 2022-06-23T07:20:14Z

I'm a bit concerned that the PR #8726 would make the problem more serious, as it's a Summary metric.

k8s-triage-robot · 2022-07-23T08:19:49Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

nailgun · 2022-07-26T09:48:42Z

Please check #8728
It deprecates summary metrics and adds histogram metrics.

Lyt99 · 2022-07-26T10:00:37Z

Please check #8728 It deprecates summary metrics and adds histogram metrics.

It's great to see that metric has been deprecated.

But the deprecated upstream_latency metric will be preserved and still calculated here. It seems that the PR can not fix this problem.

Is there any plan to remove this metric in further version?

nailgun · 2022-07-26T10:07:51Z

Sure, but it's breaking change for monitoring tools. If we are ready to do so, I can update the PR to remove these metrics in single step.

k8s-triage-robot · 2022-08-25T10:47:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-08-25T10:48:13Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Lyt99 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 2, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 2, 2022

k8s-ci-robot added the needs-priority label Mar 2, 2022

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 9, 2022

Lyt99 mentioned this issue Mar 29, 2022

Change ingress_upstream_latency_seconds metrics type to Histogram #8397

Closed

9 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 7, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 23, 2022

k8s-ci-robot closed this as completed Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive lock waits and memory increasing when enable metrics #8285

Massive lock waits and memory increasing when enable metrics #8285

Lyt99 commented Mar 2, 2022

k8s-ci-robot commented Mar 2, 2022

longwuyuan commented Mar 9, 2022

Lyt99 commented Mar 9, 2022 •

edited

Loading

longwuyuan commented Mar 9, 2022

Lyt99 commented Mar 9, 2022

k8s-triage-robot commented Jun 7, 2022

Lyt99 commented Jun 23, 2022

k8s-triage-robot commented Jul 23, 2022

nailgun commented Jul 26, 2022

Lyt99 commented Jul 26, 2022

nailgun commented Jul 26, 2022

k8s-triage-robot commented Aug 25, 2022

k8s-ci-robot commented Aug 25, 2022

Massive lock waits and memory increasing when enable metrics #8285

Massive lock waits and memory increasing when enable metrics #8285

Comments

Lyt99 commented Mar 2, 2022

k8s-ci-robot commented Mar 2, 2022

longwuyuan commented Mar 9, 2022

Lyt99 commented Mar 9, 2022 • edited Loading

longwuyuan commented Mar 9, 2022

Lyt99 commented Mar 9, 2022

k8s-triage-robot commented Jun 7, 2022

Lyt99 commented Jun 23, 2022

k8s-triage-robot commented Jul 23, 2022

nailgun commented Jul 26, 2022

Lyt99 commented Jul 26, 2022

nailgun commented Jul 26, 2022

k8s-triage-robot commented Aug 25, 2022

k8s-ci-robot commented Aug 25, 2022

Lyt99 commented Mar 9, 2022 •

edited

Loading