Reset Prometheus request tags on conf reload #2728

elafarge · 2018-07-02T07:00:41Z

What this PR does:
On every backend configuration reload, Prometheus collectors for
request metrics (bearing the upstream_ip tag) are being recreated to
avoid keeping on exposing metrics for dead upstreams on the /metrics
endpoint.

Also, we changed the structure of the collector.go file into a
singleton (to make it possible for the ingress controller to ask for a
Prometheus collector reset on these metrics on config. reloads).

Why we need it:

When an upstream pod dies (because of a scale down operation, a rolling
upgrade...), metrics associated with this given pod via the upstream_ip
tag remain exposed on the /metrics endpoints. This cause three
undesireable effects:

Analyzing these metrics on Prometheus/Grafana can be impossible:
queries easily tend to time out or frontends tend too freeze
because too many timeseries are returned
Payloads periodically fetched by Prometheus can get very big over
time on production clusters, which also make them more expensive to
ingest and store
Since these metrics stay in memory, it means we're having a (slow)
memory leak.

Which issue this PR fixes:
I didn't open an issue for that and filed a PR directly, but that can be done if required.

Special notes for your reviewer:
If the "singleton" design pattern bothers you (because you were intending to instantiate multiple collectors in the future for some reason), or anything else. I'll be happy to promptly refactor the PR.

What could we break:
On infrastructure where the config would be reloaded at a high pace (higher than the Prometheus scrape_interval, we introduce discontinuities in our timeseries. It's something that Prometheus rate() (and similar) function handle very well. However, if backend reconfiguration frequency is higher that the Prometheus scrape_interval, it's possible that we get incoherent.
If so, we may just rate-limit the metric flushing operation to 1 every "user-configurable amount of time" (ex. 5 minutes). As long as this time is superior to a couple of Prometheus scrape_intervals, we should be totally fine :)

PR Status:
The PR compiles, and his being tested on our staging infrastructure. So far so good :)
I'll try it on our production clusters today and compare the size of the payloads of the /metrics endpoint before and after the upgrade and also check that discontinuities during intense scale-in operations

TL&DR ----- - makes analysing queries (in particular, over the past few hours) on Prometheus/Grafana much lighter - on every backend configuration reload, prometheus metrics about request values are flushed - fixes a (reasonable) memory leak Description of the issue ------------------------ When an upstream pod dies (because of a scale down operation, a rolling upgrade...), metrics associated with this given pod via the upstream_ip tag remain exposed on the /metrics endpoints. This cause three undesireable effects: - Analyzing these metrics on Prometheus/Grafana can be impossible: queries easily tend to time out or frontends tend too freeze because too many timeseries are returned - Payloads periodically fetched by Prometheus can get very big over time on production clusters, which also make them more expensive to ingest and store - Since these metrics stay in memory, it means we're having a (slow) memory leak. What this commit does --------------------- On every backend configuration reload, Prometheus collectors for request metrics (bearing the upstream_ip tag) are being recreated. Also, we changed the structure of the collector.go file into a singleton (to make it possible for the ingress controller to ask for a Prometheus collector reset on these metrics on configuration reloads).

k8s-ci-robot · 2018-07-02T07:00:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elafarge
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: justinsb

Assign the PR to them by writing /assign @justinsb in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-io · 2018-07-02T07:24:24Z

Codecov Report

Merging #2728 into master will decrease coverage by 0.05%.
The diff coverage is 0%.

@@            Coverage Diff            @@
##           master   #2728      +/-   ##
=========================================
- Coverage   40.96%   40.9%   -0.06%     
=========================================
  Files          72      72              
  Lines        5085    5099      +14     
=========================================
+ Hits         2083    2086       +3     
- Misses       2713    2725      +12     
+ Partials      289     288       -1

Impacted Files	Coverage Δ
cmd/nginx/main.go	`24.81% <0%> (+2.18%)`	⬆️
internal/ingress/metric/collector/collector.go	`2.85% <0%> (-0.23%)`	⬇️
internal/ingress/controller/controller.go	`2.39% <0%> (-0.01%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92474ed...113597e. Read the comment docs.

aledbf · 2018-07-02T13:29:22Z

Closing. This is being done in #2726

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 2, 2018

aledbf closed this Jul 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset Prometheus request tags on conf reload #2728

Reset Prometheus request tags on conf reload #2728

elafarge commented Jul 2, 2018

k8s-ci-robot commented Jul 2, 2018

codecov-io commented Jul 2, 2018

aledbf commented Jul 2, 2018

Reset Prometheus request tags on conf reload #2728

Reset Prometheus request tags on conf reload #2728

Conversation

elafarge commented Jul 2, 2018

k8s-ci-robot commented Jul 2, 2018

codecov-io commented Jul 2, 2018

Codecov Report

aledbf commented Jul 2, 2018