-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove prometheus labels with high cardinality #2701
Conversation
ping @discordianfish |
@aledbf Nice! LGTM, assuming that ngx.var.location_path is the path configured in the config (vs something coming from the client). |
@discordianfish please check #2702 . That PR just replaces the URI label with the fixed Path defined in the Ingress |
Yes |
Just to show an alternative to remove all the labels. |
Ah got it. So yeah I'd go with this here since there is no way to ensure that the other vars are bounded. |
@discordianfish thank you! |
bytesSent = 150.0, | ||
protocol = "HTTP", | ||
method = "GET", | ||
uri = "/admin", | ||
path = "/admin", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not location_path
? Or the other way around, why location_path
in other places and path
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
location_path -> from nginx (variable)
path -> go side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, nginx_environment
vs. expected_json_stats
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aledbf, antoineco The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov Report
@@ Coverage Diff @@
## master #2701 +/- ##
==========================================
+ Coverage 40.92% 41.08% +0.16%
==========================================
Files 72 72
Lines 5087 5084 -3
==========================================
+ Hits 2082 2089 +7
+ Misses 2716 2707 -9
+ Partials 289 288 -1
Continue to review full report at Codecov.
|
👌 Thanks a lot for the quickfix :) |
Thanks for the quickfix but for me it's not totally fixed, we still have some labels with high cardinality. In my case with only 2 ingress configured on the controller:
|
But that's a good thing, right? This is telling you have some problems with the application or with the configured timeouts. In fact, you should create some alerts for that codes (504 and 502)
How many endpoints do you have? Also, how many request for that number of events? |
IMO there are 2 problems here:
Moreover each of theses label involves a combinatorial explosion, it increase the number of metrics by adding combinaison of upstream_ip and upstream_status. If you don't want to drop entirely theses labels, to add the ability to exclude labels like in #2699 would be a good alternative. |
@discordianfish can you provide some feedback for ^^ ? |
This means the first response from the upstream server returned 504, then 502 from the next one and finally - (usually means connection closed). Can you post the log for this example? (ingress controller pod log). |
This is fine IMO, it's bounded by the number of backend servers. Having hundreds seems reasonable. But the always changing upstream ips might become a problem for a long living ingress controller, not sure if the local registry every gets cleaned up. Usually you would use ConstMetrics that just return a value from another system (e.g vts before), so you don't need to keep state.. |
Sorry I misspoke myself, I know what it mean, but I think having a prometheus label like this has no meaning. What does a metric like upstream_response_time{upstream_status="504, 502, -"} mean ? Is it the response time of the first or the last request ? I can't imagine a use case where I want to use this. I can't provide the log for now (no longer at work) but I don't think it will help, I know why the upstream didn't respond, I was running a load test and the upstream took too long to respond. What I could provide tomorrow if you want is an example of metrics created by my load test that illustrate my point.
As I understand it's not only bound by the number of backend servers but at least by:
And this for each protocol, http status code, host and method. I dont know the exact maths but thing can really add up. I'm not sure I'm explaining my point correctly. I will try to expand my testing tomorrow and provide some sample of the /metrics endpoint. |
You are right about this. The variables
I understand what you say but any alteration to the order could introduce a misunderstanding about what nginx is doing. |
Yes, this scenario makes sense, that's why I am asking for the log so we can see how to improve this |
If I understand the overall architecture of this correctly, then the lua scripts on every request push metrics to the nginx-controller process via the unix socket, upon which the nginx-controller process increments/sets all metrics respective to that backend. This is going to get leaky if not handled carefully, I've seen numerous memory leaks due to leaking metrics (ironic, I know). Basically what you need to make sure, is that you only ever expose metrics of those ingress/backends/etc. that are actually in existance at a current moment in time. The strategy I would use here, albeit not particularly great is to keep increment/set logic as you have it and have another go routine, that regularly cleans up any metrics that do not have a respective ingress object associated with it (anymore). If I understand correctly this should not be a problem as the nginx-controller fully configures nginx so it will always know whenever any of these things change.
|
@brancz thank you for the feedback
This should be easy to implement. We have a point in the code where we swap the "old" configuration with the new one where we can do the cleanup. |
That sounds good. You will want to have a look at the [0] https://godoc.org/github.com/prometheus/client_golang/prometheus#CounterVec.Delete |
Replaces #2699