-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade client_golang to reduce memory usage with many metrics #76
Comments
Are you running one central The general recommendation is to run one exporter per node, and configure all apps to send to It also helps if you are using UDP to transport statsd metrics, as packet loss is less of an issue. |
That comes out as ~10KiB per metric, that is indeed a lot. As @SuperQ says, the first thing I would do is colocate the exporter with every data source, instead of the central statsd server. I am still interested in getting to the bottom of the memory usage. What are all these bytes? Do we really need them? Any input is welcome! |
I got a notification for a comment that made good points but appears to be lost? This might be related to summaries with quantile – if you have a lot of timer metrics, these are translated into summaries and can be this expensive. Now, generally histograms are probably a better idea here, but open the question of how to configure buckets. On the other hand, one way to mitigate the memory cost of quantiles would be to make the MaxAge, BufCap and AgeBuckets also configurable, so the complexity would be the same. How should this look in configuration? |
@matthiasr I deleted the comment because I find my metrics is still OOM after I switch to histogram (default buckets). And the calculated average memory is still quite high. You can try it using this configuration:
|
MaxAge, BufCap and AgeBuckets are not configurable now. I'm trying to make them configurable globally in my branch:
If it works they should also be configurable per metric. |
Some testing results. It seems related to the summary timer type.
Some findings: When there are enough stats sent, the BufCap and MaxAge doesn't really impact the memory usage. |
That's a really good find, thanks a lot @shuz. At the very least, this is something to mention in the README. How do people feel about changing the default? |
In my opinion if it is to be changed, the place to change it is in client_golang. I'm not surprised that summary quantiles are expensive, it's one of the reasons I recommend against them. |
whether statsd timers are observed into histograms or summaries is already configurable, both per match and globally. I don't know what would need to change in the library? |
sorry, I wasn't explicit enough. I meant changing the default for the |
Ah, I don't see a problem with changing it. |
Some more findings. In our service we find the actual requests doesn't have too many samples. We are in a pattern that there are many dynamic combinations that are measured, although avg or max of p99 doesn't make too much sense. |
What drawbacks does this approach have? How does this behave with concurrent scrapes? Should we do this without too many go routines generally? For your use case (aggregation across dimensions) histograms are really a better choice though – you can sum these up, grouped as you like, and get reasonable quantile estimations across dimensions. |
It's a client golang thing, but I think using goroutines only for custom collectors might make sense. There's no real need with the standard metric types. |
Do you mean it's a problem in the library, or in how we use it? How can we improve this? |
It's client_golang internals, though having a metric object per child doesn't exactly help. @beorn7 |
In our use case, we setup statsd_exporter as a bridge to existing apps. They used to send quite a lot of metrics with labels using dogstatsd format. And that agent kept running well with around 150M memory, since it flush every 10 seconds. We find that using statsd_exporter, the memory usage always grows and the average memory used by each metric doesn't feel right. Then we find the number of goroutines is growing with the memory usage: Then we hacked the prometheus golang client lib to use only 1 go routine in Gather():
It's special case since we are handling quite a lot metrics in the container, but it reduced the memory usage by half. |
Making |
@brian-brazil just to be sure I understand correctly – part of the problem is how we initialize a new metric object for each label combination here (and in the other Get implementations)? And we could optimize this by only indexing on the metricName only, and always passing the labels to that metric object? If we ever wanted to expire metrics, that would mean we could only do so on a per-metric not a per-timeseries level, but that's not necessarily bad. |
Yes.
Yes, though that's presuming client_golang internals don't change. |
It seems to me like that's also semantically more how metrics objects are meant to be used. |
with the help of some auto-generated events, benchmark the collection part of the exporter. In #76 it was reported that high label cardinality can cause issues.
@matthiasr I still see high memory usage when using histograms. Just curious to know if the client_golang from prometheus/client_golang#370 is merged in statsd_exporter and if it will fix the issue. |
@sathsb no, I haven't done that yet – I'm sorry for the long silence, I was away for a few months. While I'm still catching up, would you like to submit a PR that updates the vendored dependency/dependencies? |
There was an upgrade to a newer, but not new-est, client version in #119 – if you want to pick this up you can probably salvage some of the necessary changes from that. |
Summary: Timers are very memory heavy for prometheus so I am disabling prometheus reporting for resource manager. It is not being used right now and will make master stable by not blocking the performance tests. There will be a subsequent effort for see we can use histograms instead of timers. github issue: prometheus/statsd_exporter#76 Test Plan: ran benchmark test Reviewers: rcharles, #peloton, mabansal Reviewed By: #peloton, mabansal Subscribers: jenkins Maniphest Tasks: T1976007 Differential Revision: https://code.uberinternal.com/D1983523
I believe this is done now. |
Hi,
I'm seeing a ratio of 10k metrics per 100M of RAM, for production metrics. Interested to hear first if that's normal, and then, what's a good strategy for handling 1M metrics (this also affects the
metrics
endpoint polled by prom, obviously).In my tests for 1M metrics, I've gotten 28GB of RAM and growing, and all cores max out (had to kill the process).
The text was updated successfully, but these errors were encountered: