Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive lock waits and memory increasing when enable metrics #8285

Closed
Lyt99 opened this issue Mar 2, 2022 · 13 comments
Closed

Massive lock waits and memory increasing when enable metrics #8285

Lyt99 opened this issue Mar 2, 2022 · 13 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@Lyt99
Copy link
Contributor

Lyt99 commented Mar 2, 2022

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

Kubernetes version (use kubectl version):

v1.22.3-aliyun.1

Environment:

  • Cloud provider or hardware configuration: Alibaba Cloud
  • OS (e.g. from /etc/os-release): Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle)
  • Kernel (e.g. uname -a): Linux iZbp1c41y5oz5mltc5ed5sZ 4.19.91-24.1.al7.x86_64 #1 SMP Wed Jul 21 17:40:23 CST 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Node specs: ecs.c7.8xlarge (32c128g)
  • Controller replicas: 4
  • Controller limits: 16core 32g

What happened:

We performed a benchmark using Alibaba Cloud's PTS(Performance Testing Service), the volume of testing is around 500K RPS(120K on each pod). We observed a remarkable memory increasing, up to 32GiB and caused OOMs. And when we finished benchmark, the memory reduced to a normal level after a while.

image

We executed top in the pod, and it showed that nginx-ingress-controller had the largest memory consuming. And nginx seemed to work properly.

Wwe got pprof/heap result svg here

      flat  flat%   sum%        cum   cum%
    8.33GB 65.86% 65.86%     8.33GB 65.86%  reflect.unsafe_NewArray
    3.37GB 26.64% 92.50%     4.07GB 32.16%  github.com/json-iterator/go.(*Iterator).ReadString
    0.70GB  5.52% 98.02%     0.70GB  5.52%  github.com/json-iterator/go.(*Iterator).readEscapedChar
    0.19GB  1.53% 99.55%     0.19GB  1.53%  io.ReadAll
    0.02GB  0.18% 99.73%    12.43GB 98.20%  k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).handleMessage
         0     0% 99.73%    12.40GB 98.02%  github.com/json-iterator/go.(*Iterator).ReadVal
         0     0% 99.73%     0.70GB  5.52%  github.com/json-iterator/go.(*Iterator).readStringSlowPath
         0     0% 99.73%    12.40GB 98.02%  github.com/json-iterator/go.(*frozenConfig).Unmarshal
         0     0% 99.73%     4.07GB 32.16%  github.com/json-iterator/go.(*generalStructDecoder).Decode
         0     0% 99.73%     4.07GB 32.16%  github.com/json-iterator/go.(*generalStructDecoder).decodeOneField

shows that k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).handleMessage possesses memory when Unmarshal json bytes to structure. (https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/metric/collectors/socket.go#L240)

And then we checked pprof/goroutines result svg here

Showing nodes accounting for 5018, 99.80% of 5028 total
Dropped 126 nodes (cum <= 25)
Showing top 10 nodes out of 11
      flat  flat%   sum%        cum   cum%
      5018 99.80% 99.80%       5018 99.80%  runtime.gopark
         0     0% 99.80%       4933 98.11%  github.com/prometheus/client_golang/prometheus.(*summary).Observe
         0     0% 99.80%       4933 98.11%  k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).handleMessage
         0     0% 99.80%       4933 98.11%  k8s.io/ingress-nginx/internal/ingress/metric/collectors.handleMessages
         0     0% 99.80%         29  0.58%  runtime.chanrecv
         0     0% 99.80%       4946 98.37%  runtime.goparkunlock (inline)
         0     0% 99.80%         35   0.7%  runtime.selectgo
         0     0% 99.80%       4933 98.11%  runtime.semacquire1
         0     0% 99.80%       4933 98.11%  sync.(*Mutex).Lock (inline)
         0     0% 99.80%       4933 98.11%  sync.(*Mutex).lockSlow

Most goroutines are in the github.com/prometheus/client_golang/prometheus.(*summary).Observe and waited for Mutex to be release. We checked the code for prometheus Summary and it does have a Mutex when doing an Observe, because of quantile recalculation. The Summary is only used by upstreamLantency.

https://github.com/prometheus/client_golang/blob/main/prometheus/summary.go#L284

It could be a time-costing procedure, and block the following batches of data to be observed. And handleMessage will unmarshal all socketData into memory, which causes more memory consumption, make the lock wait problem heavier.

How to reproduce it:

Do a heavy pressure benchmark to the ingress controller.

Anything else we need to know:

Possible solutions:

  • Use lock-free Histogram instead of Summary for upstreamLatency
  • In the meantime, try to reduce memory consumption for handleMessage (may use iterator to prevent unmarshall all data into struct slice?)
@Lyt99 Lyt99 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 2, 2022
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 2, 2022
@k8s-ci-robot
Copy link
Contributor

@Lyt99: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@longwuyuan
Copy link
Contributor

From all the data you have provided, it seems like you did a perf test by loading traffic and you got memory spike. Just to state the obvious, it seems a logical expected thing for memory usage to increase when you test with heavy load.

/remove-kind bug

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 9, 2022
@Lyt99
Copy link
Contributor Author

Lyt99 commented Mar 9, 2022

From all the data you have provided, it seems like you did a perf test by loading traffic and you got memory spike. Just to state the obvious, it seems a logical expected thing for memory usage to increase when you test with heavy load.

/remove-kind bug

Yes, it's expected that memory usage increases when performing heavy load. But it seems not to be expected that metrics being enabled causes a huge memory consuming, which may cause OOM.

I think we can considered this problem as a performance issue that can be solved. IIUC, the root cause of this issue is, prometheus's Summary metrics need to recalculate the quantile, which is expensive and not lock-free. It will block other incoming metrics requests, and the memory they allocated can not be released in time.

I can do another test with --enable-metrics=false later to make a comparison. It should be a great decreasing on memory usage.

BTW, I searched for other issues, and seems #8228 relevant to this problem.

@longwuyuan
Copy link
Contributor

I have not looked at the details yet, but if what you say is true, then this is a problem out of scope for ingress-nginx project I think. When there is heavy load and all cpu/memory is used up in scraping metrics and logging, it makes sense to know that recalculation or calculation or aggregation work will starve for resources and very likely OOM. I don't know if the metrics exposing code is out of box from prometheus or custom built for ingress controller, by ingress controller developers.

Are you thinking of contributing and submitting PR. If not, then please provide any other details you have here and lets hope someone has time to work on it.

@Lyt99
Copy link
Contributor Author

Lyt99 commented Mar 9, 2022

I have not looked at the details yet, but if what you say is true, then this is a problem out of scope for ingress-nginx project I think. When there is heavy load and all cpu/memory is used up in scraping metrics and logging, it makes sense to know that recalculation or calculation or aggregation work will starve for resources and very likely OOM. I don't know if the metrics exposing code is out of box from prometheus or custom built for ingress controller, by ingress controller developers.

Are you thinking of contributing and submitting PR. If not, then please provide any other details you have here and lets hope someone has time to work on it.

I have some ideas and I will try them these days. If it works, I will submit a PR.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 7, 2022
@Lyt99
Copy link
Contributor Author

Lyt99 commented Jun 23, 2022

I'm a bit concerned that the PR #8726 would make the problem more serious, as it's a Summary metric.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 23, 2022
@nailgun
Copy link
Contributor

nailgun commented Jul 26, 2022

Please check #8728
It deprecates summary metrics and adds histogram metrics.

@Lyt99
Copy link
Contributor Author

Lyt99 commented Jul 26, 2022

Please check #8728 It deprecates summary metrics and adds histogram metrics.

It's great to see that metric has been deprecated.

But the deprecated upstream_latency metric will be preserved and still calculated here. It seems that the PR can not fix this problem.

Is there any plan to remove this metric in further version?

@nailgun
Copy link
Contributor

nailgun commented Jul 26, 2022

Sure, but it's breaking change for monitoring tools. If we are ready to do so, I can update the PR to remove these metrics in single step.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants