Remove prometheus labels with high cardinality #2701

aledbf · 2018-06-25T13:23:00Z

Replaces #2699

aledbf · 2018-06-25T13:23:10Z

discordianfish · 2018-06-25T13:28:14Z

@aledbf Nice! LGTM, assuming that ngx.var.location_path is the path configured in the config (vs something coming from the client).

aledbf · 2018-06-25T13:33:51Z

@discordianfish please check #2702 . That PR just replaces the URI label with the fixed Path defined in the Ingress

aledbf · 2018-06-25T13:34:02Z

Nice! LGTM, assuming that ngx.var.location_path is the path configured in the config (vs something coming from the client).

Yes

discordianfish · 2018-06-25T13:38:17Z

@aledbf Why are there two new PRs now? This here looks like the way to go while #2702 still has the remote ip etc in it.

aledbf · 2018-06-25T13:40:08Z

Why are there two new PRs now? This here looks like the way to go while #2702 still has the remote ip etc in it.

Just to show an alternative to remove all the labels.

discordianfish · 2018-06-25T13:42:43Z

Ah got it. So yeah I'd go with this here since there is no way to ensure that the other vars are bounded.

aledbf · 2018-06-25T13:44:28Z

@discordianfish thank you!

antoineco · 2018-06-25T14:00:06Z

rootfs/etc/nginx/lua/test/monitor_test.lua

                bytesSent = 150.0,
                protocol = "HTTP",
                method = "GET",
-                uri = "/admin",
+                path = "/admin",


Why not location_path? Or the other way around, why location_path in other places and path here.

location_path -> from nginx (variable) path -> go side

Ah right, nginx_environment vs. expected_json_stats

antoineco · 2018-06-25T14:03:57Z

/lgtm

k8s-ci-robot · 2018-06-25T14:04:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, antoineco

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [aledbf,antoineco]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-io · 2018-06-25T14:18:35Z

Codecov Report

Merging #2701 into master will increase coverage by 0.16%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2701      +/-   ##
==========================================
+ Coverage   40.92%   41.08%   +0.16%     
==========================================
  Files          72       72              
  Lines        5087     5084       -3     
==========================================
+ Hits         2082     2089       +7     
+ Misses       2716     2707       -9     
+ Partials      289      288       -1

Impacted Files	Coverage Δ
internal/ingress/metric/collector/collector.go	`3.08% <0%> (+0.05%)`	⬆️
cmd/nginx/main.go	`27.73% <0%> (+5.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0ed143...6c8647a. Read the comment docs.

elafarge · 2018-06-25T14:25:57Z

👌 Thanks a lot for the quickfix :)

mbugeia · 2018-06-26T13:50:50Z

Thanks for the quickfix but for me it's not totally fixed, we still have some labels with high cardinality.
upstream_ip and upstream_status can cause a lot of metrics to be created in particular when there are retry.
For example we can have upstream_status="504, 502, -", upstream_status="504, 200"... and in upstream_ip the order can change at each request (especially when you have a lot of backend).

In my case with only 2 ingress configured on the controller:

www-data@ingress-5ff44654b9-7hfg7:/etc/nginx$ curl -s localhost:10254/metrics |wc -l 
16965

aledbf · 2018-06-26T14:08:07Z

upstream_ip and upstream_status can cause a lot of metrics to be created in particular when there are retry.

But that's a good thing, right? This is telling you have some problems with the application or with the configured timeouts. In fact, you should create some alerts for that codes (504 and 502)

In my case with only 2 ingress configured on the controller:

How many endpoints do you have? Also, how many request for that number of events?

mbugeia · 2018-06-26T15:34:13Z

IMO there are 2 problems here:

In case of retry the metrics are unusable because the label mix information from several different upstream requests. And filtering by a label like upstream_status="504, 502, -" means nothing for me.
upstream_ip by its nature have a high cardinality. In my example we have between 3-5 endpoint depending scaling) but we have some service with 20 endpoints and can easily imagine other use cases with 100+ endpoints.

Moreover each of theses label involves a combinatorial explosion, it increase the number of metrics by adding combinaison of upstream_ip and upstream_status.
I think the number of metrics need to be limited, I'm glad that there are more precise metrics that in the vts module (like by accurate http status code) but too many harms the observability and I'm very concerned about breaking my prometheus instances.

If you don't want to drop entirely theses labels, to add the ability to exclude labels like in #2699 would be a good alternative.

aledbf · 2018-06-26T15:37:12Z

@discordianfish can you provide some feedback for ^^ ?

aledbf · 2018-06-26T16:24:17Z

upstream_status="504, 502, -" means nothing for me.

This means the first response from the upstream server returned 504, then 502 from the next one and finally - (usually means connection closed).

Can you post the log for this example? (ingress controller pod log).

discordianfish · 2018-06-26T17:44:30Z

upstream_ip by its nature have a high cardinality. In my example we have between 3-5 endpoint depending scaling) but we have some service with 20 endpoints and can easily imagine other use cases with 100+ endpoints.

This is fine IMO, it's bounded by the number of backend servers. Having hundreds seems reasonable.
You can always drop metrics on scape: https://www.robustperception.io/dropping-metrics-at-scrape-time-with-prometheus/

But the always changing upstream ips might become a problem for a long living ingress controller, not sure if the local registry every gets cleaned up. Usually you would use ConstMetrics that just return a value from another system (e.g vts before), so you don't need to keep state..

@brancz @SuperQ: Do you have some thought on this?

mbugeia · 2018-06-26T18:45:00Z

This means the first response from the upstream server returned 504, then 502 from the next one and finally - (usually means connection closed).

Sorry I misspoke myself, I know what it mean, but I think having a prometheus label like this has no meaning. What does a metric like upstream_response_time{upstream_status="504, 502, -"} mean ? Is it the response time of the first or the last request ? I can't imagine a use case where I want to use this.

I can't provide the log for now (no longer at work) but I don't think it will help, I know why the upstream didn't respond, I was running a load test and the upstream took too long to respond. What I could provide tomorrow if you want is an example of metrics created by my load test that illustrate my point.

This is fine IMO, it's bounded by the number of backend servers. Having hundreds seems reasonable.

As I understand it's not only bound by the number of backend servers but at least by:

the number of backend servers
the number of retry (retry will create label like upstream_ip="10.0.0.1, 10.0.0.2, 10.0.0.3" not in a predictable order, same with upstream_status but with http status code)

And this for each protocol, http status code, host and method. I dont know the exact maths but thing can really add up.

I'm not sure I'm explaining my point correctly. I will try to expand my testing tomorrow and provide some sample of the /metrics endpoint.

aledbf · 2018-06-26T20:18:08Z

What does a metric like upstream_response_time{upstream_status="504, 502, -"} mean ? Is it the response time of the first or the last request ?

You are right about this. The variables upstream_ip, upstream_response_time and upstream_status are strings comma separated so we can split and generated three metrics. Not sure if this makes sense dough, the list contains information about what was required to return a response not different client requests.

the number of retry (retry will create label like upstream_ip="10.0.0.1, 10.0.0.2, 10.0.0.3" not in a predictable order, same with upstream_status but with http status code)

I understand what you say but any alteration to the order could introduce a misunderstanding about what nginx is doing.

aledbf · 2018-06-26T20:19:10Z

I'm not sure I'm explaining my point correctly. I will try to expand my testing tomorrow and provide some sample of the /metrics endpoint.

Yes, this scenario makes sense, that's why I am asking for the log so we can see how to improve this

brancz · 2018-06-27T08:01:48Z

If I understand the overall architecture of this correctly, then the lua scripts on every request push metrics to the nginx-controller process via the unix socket, upon which the nginx-controller process increments/sets all metrics respective to that backend. This is going to get leaky if not handled carefully, I've seen numerous memory leaks due to leaking metrics (ironic, I know). Basically what you need to make sure, is that you only ever expose metrics of those ingress/backends/etc. that are actually in existance at a current moment in time. The strategy I would use here, albeit not particularly great is to keep increment/set logic as you have it and have another go routine, that regularly cleans up any metrics that do not have a respective ingress object associated with it (anymore). If I understand correctly this should not be a problem as the nginx-controller fully configures nginx so it will always know whenever any of these things change.

ConstMetrics as mentioned by @discordianfish are also an option, but it would mean you need to cache the metric data sent through the unix socket yourself, in which case you would find yourself doing roughly the same housekeeping, as you would still always need to make sure to throw away old data. The only positive would be that you could control not exposing stale metrics at scrape time, rather than by relying on data to be cleaned up properly.

aledbf · 2018-06-27T13:30:01Z

@brancz thank you for the feedback

The strategy I would use here, albeit not particularly great is to keep increment/set logic as you have it and have another go routine, that regularly cleans up any metrics that do not have a respective ingress object associated with it (anymore)

This should be easy to implement. We have a point in the code where we swap the "old" configuration with the new one where we can do the cleanup.

brancz · 2018-06-27T13:57:18Z

That sounds good. You will want to have a look at the Delete[0] and/or DeleteLabelValues[1]. Make sure to attempt to delete all metrics for all possible label combinations, for example, you will need to attempt to remove all metrics for all statuses and method.

[0] https://godoc.org/github.com/prometheus/client_golang/prometheus#CounterVec.Delete
[1] https://godoc.org/github.com/prometheus/client_golang/prometheus#CounterVec.DeleteLabelValues

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 25, 2018

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 25, 2018

Remove prometheus labels with high cardinality

6c8647a

aledbf force-pushed the remove-labels branch from cb7608c to 6c8647a Compare June 25, 2018 13:44

aledbf assigned ElvinEfendi and antoineco Jun 25, 2018

antoineco reviewed Jun 25, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 25, 2018

k8s-ci-robot merged commit c196761 into kubernetes:master Jun 25, 2018

aledbf deleted the remove-labels branch June 25, 2018 16:51

This was referenced Jun 27, 2018

Cleanup prometheus metrics after a reload #2716

Closed

Cleanup prometheus metrics after a reload #2726

Merged

luispollo mentioned this pull request Sep 25, 2018

Document prometheus metrics #2924

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove prometheus labels with high cardinality #2701

Remove prometheus labels with high cardinality #2701

aledbf commented Jun 25, 2018

aledbf commented Jun 25, 2018

discordianfish commented Jun 25, 2018

aledbf commented Jun 25, 2018

aledbf commented Jun 25, 2018

discordianfish commented Jun 25, 2018

aledbf commented Jun 25, 2018

discordianfish commented Jun 25, 2018

aledbf commented Jun 25, 2018

antoineco Jun 25, 2018 •

edited

Loading

aledbf Jun 25, 2018 •

edited

Loading

antoineco Jun 25, 2018

antoineco commented Jun 25, 2018

k8s-ci-robot commented Jun 25, 2018

codecov-io commented Jun 25, 2018

elafarge commented Jun 25, 2018

mbugeia commented Jun 26, 2018

aledbf commented Jun 26, 2018 •

edited

Loading

mbugeia commented Jun 26, 2018

aledbf commented Jun 26, 2018

aledbf commented Jun 26, 2018

discordianfish commented Jun 26, 2018

mbugeia commented Jun 26, 2018

aledbf commented Jun 26, 2018

aledbf commented Jun 26, 2018

brancz commented Jun 27, 2018

aledbf commented Jun 27, 2018

brancz commented Jun 27, 2018

Remove prometheus labels with high cardinality #2701

Remove prometheus labels with high cardinality #2701

Conversation

aledbf commented Jun 25, 2018

aledbf commented Jun 25, 2018

discordianfish commented Jun 25, 2018

aledbf commented Jun 25, 2018

aledbf commented Jun 25, 2018

discordianfish commented Jun 25, 2018

aledbf commented Jun 25, 2018

discordianfish commented Jun 25, 2018

aledbf commented Jun 25, 2018

antoineco Jun 25, 2018 • edited Loading

Choose a reason for hiding this comment

aledbf Jun 25, 2018 • edited Loading

Choose a reason for hiding this comment

antoineco Jun 25, 2018

Choose a reason for hiding this comment

antoineco commented Jun 25, 2018

k8s-ci-robot commented Jun 25, 2018

codecov-io commented Jun 25, 2018

Codecov Report

elafarge commented Jun 25, 2018

mbugeia commented Jun 26, 2018

aledbf commented Jun 26, 2018 • edited Loading

mbugeia commented Jun 26, 2018

aledbf commented Jun 26, 2018

aledbf commented Jun 26, 2018

discordianfish commented Jun 26, 2018

mbugeia commented Jun 26, 2018

aledbf commented Jun 26, 2018

aledbf commented Jun 26, 2018

brancz commented Jun 27, 2018

aledbf commented Jun 27, 2018

brancz commented Jun 27, 2018

antoineco Jun 25, 2018 •

edited

Loading

aledbf Jun 25, 2018 •

edited

Loading

aledbf commented Jun 26, 2018 •

edited

Loading