Batch metrics and flush periodically #2957

ElvinEfendi · 2018-08-18T00:05:28Z

What this PR does / why we need it:
Currently Lua code sends metrics to the controller for every request right after Nginx finishes responding. This means the load on controller increases linearly with the number of requests. That results in high CPU usage and dropped metrics ("Resource unavailable - when the controller can not process the metrics)".
This PR changes the logic in Lua side to batch the metrics and send the batches periodically to the controller. I've configured the period to 1s for now. Since we are batching the metrics in Nginx worker's memory to avoid potential unbounded memory increase I've also added limit on the number of metrics that can be batched during that 1s. That number is 10000 right now - which is a really high number. With this change Nginx would be dropping metrics only if an Nginx worker is handling more than 10000 RPS.

I've tested this change locally under heavy load and did not see any example of metric being dropped or "Resource unavailable" error. Note however this change means metrics will be delayed for a second.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

ElvinEfendi · 2018-08-18T17:27:25Z

/assign @aledbf
/assign @antoineco

codecov-io · 2018-08-18T17:40:26Z

Codecov Report

Merging #2957 into master will increase coverage by 0.01%.
The diff coverage is 68.33%.

@@            Coverage Diff             @@
##           master    #2957      +/-   ##
==========================================
+ Coverage   47.55%   47.57%   +0.01%     
==========================================
  Files          77       77              
  Lines        5633     5633              
==========================================
+ Hits         2679     2680       +1     
+ Misses       2602     2600       -2     
- Partials      352      353       +1

Impacted Files	Coverage Δ
internal/ingress/metric/collectors/socket.go	`79.27% <68.33%> (ø)`	⬆️
internal/watch/file_watcher.go	`84.61% <0%> (+3.84%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b78bb25...2207d76. Read the comment docs.

aledbf · 2018-08-18T21:13:59Z

/lgtm

k8s-ci-robot · 2018-08-18T21:14:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, ElvinEfendi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ElvinEfendi,aledbf]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aledbf · 2018-08-18T21:14:23Z

@ElvinEfendi this looks great. Thank you for working on this!

batch metrics and flush periodically

2207d76

ElvinEfendi changed the title ~~[WIP] Batch metrics~~ Batch metrics and flush periodically Aug 18, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 18, 2018

k8s-ci-robot assigned aledbf and antoineco Aug 18, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 18, 2018

k8s-ci-robot merged commit a982713 into kubernetes:master Aug 18, 2018

ElvinEfendi deleted the batch-metrics branch August 18, 2018 22:29

aledbf mentioned this pull request Aug 23, 2018

High CPU usage of nginx-ingress-controller #2937

Closed

aledbf mentioned this pull request Sep 12, 2018

Issue with connection to /tmp/prometheus-nginx.socket #3084

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch metrics and flush periodically #2957

Batch metrics and flush periodically #2957

ElvinEfendi commented Aug 18, 2018 •

edited

Loading

ElvinEfendi commented Aug 18, 2018

codecov-io commented Aug 18, 2018 •

edited

Loading

aledbf commented Aug 18, 2018

k8s-ci-robot commented Aug 18, 2018

aledbf commented Aug 18, 2018

Batch metrics and flush periodically #2957

Batch metrics and flush periodically #2957

Conversation

ElvinEfendi commented Aug 18, 2018 • edited Loading

ElvinEfendi commented Aug 18, 2018

codecov-io commented Aug 18, 2018 • edited Loading

Codecov Report

aledbf commented Aug 18, 2018

k8s-ci-robot commented Aug 18, 2018

aledbf commented Aug 18, 2018

ElvinEfendi commented Aug 18, 2018 •

edited

Loading

codecov-io commented Aug 18, 2018 •

edited

Loading