Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade client_golang to reduce memory usage with many metrics #76

Closed
jondot opened this issue Aug 1, 2017 · 25 comments
Closed

Upgrade client_golang to reduce memory usage with many metrics #76

jondot opened this issue Aug 1, 2017 · 25 comments

Comments

@jondot
Copy link

jondot commented Aug 1, 2017

Hi,
I'm seeing a ratio of 10k metrics per 100M of RAM, for production metrics. Interested to hear first if that's normal, and then, what's a good strategy for handling 1M metrics (this also affects the metrics endpoint polled by prom, obviously).

In my tests for 1M metrics, I've gotten 28GB of RAM and growing, and all cores max out (had to kill the process).

@SuperQ
Copy link
Member

SuperQ commented Aug 12, 2017

Are you running one central statd_exporter?

The general recommendation is to run one exporter per node, and configure all apps to send to localhost. This allows for easy scaling and eliminates the exporter as a SPoF.

It also helps if you are using UDP to transport statsd metrics, as packet loss is less of an issue.

@SuperQ SuperQ self-assigned this Aug 12, 2017
@matthiasr
Copy link
Contributor

That comes out as ~10KiB per metric, that is indeed a lot.

As @SuperQ says, the first thing I would do is colocate the exporter with every data source, instead of the central statsd server.

I am still interested in getting to the bottom of the memory usage. What are all these bytes? Do we really need them? Any input is welcome!

@matthiasr
Copy link
Contributor

I got a notification for a comment that made good points but appears to be lost?

This might be related to summaries with quantile – if you have a lot of timer metrics, these are translated into summaries and can be this expensive.

Now, generally histograms are probably a better idea here, but open the question of how to configure buckets. On the other hand, one way to mitigate the memory cost of quantiles would be to make the MaxAge, BufCap and AgeBuckets also configurable, so the complexity would be the same.

How should this look in configuration?

@shuz
Copy link

shuz commented Jan 11, 2018

@matthiasr I deleted the comment because I find my metrics is still OOM after I switch to histogram (default buckets). And the calculated average memory is still quite high. You can try it using this configuration:

defaults:
  timer_type: histogram

@shuz
Copy link

shuz commented Jan 11, 2018

MaxAge, BufCap and AgeBuckets are not configurable now.

I'm trying to make them configurable globally in my branch:

func (c *SummaryContainer) Get(metricName string, labels prometheus.Labels, help string) (prometheus.Summary, error) {
	hash := hashNameAndLabels(metricName, labels)
	summary, ok := c.Elements[hash]
	if !ok {
		summary = prometheus.NewSummary(
			prometheus.SummaryOpts{
				Name:        metricName,
				Help:        help,
				ConstLabels: labels,
				BufCap:      500,
				AgeBuckets:  2,
				MaxAge:      120 * time.Second,
			})

If it works they should also be configurable per metric.

@shuz
Copy link

shuz commented Jan 11, 2018

Some testing results. It seems related to the summary timer type.

Test1: 80000 metrics, no timers 
80000 time series
Memory: 100-200m

Test2: 20000 timers with histogram with 10 buckets
280000 time series
memory: 50-60m

Test3: 20000 timers with summary (0.5:0.05, 0.9:0.01, 0.99:0.001), constant value (always 200ms)
100000 time series
memory: 1.7-5.4g

Test4: 20000 timers with summary (0.5:0.05, 0.95:0.005), constant value (always 200ms)
memory: similar to test3

Test5: 20000 timers with summary (0.5:0.05, 0.95:0.005), constant value (always 200ms)
AgeBuckets=2
memory: About 1.3g

Some findings:

When there are enough stats sent, the BufCap and MaxAge doesn't really impact the memory usage.
AgeBuckets=1 is optimal in memory but may result in unstable quantile values. If I use some real data instead of 200ms constant value, the memory usage will be higher. In my test, AgeBuckets=2 reached about 3G memory when using random values r.NormFloat64()*50+200.
Changing the targets from (0.5:0.05, 0.9:0.01, 0.99:0.001) to (0.5:0.05, 0.95:0.005) doesn't change the memory usage a lot.

@matthiasr
Copy link
Contributor

That's a really good find, thanks a lot @shuz. At the very least, this is something to mention in the README.

How do people feel about changing the default?

@matthiasr matthiasr changed the title Memory usage vs metric amount Document high memory usage with summaries Jan 11, 2018
@brian-brazil
Copy link
Contributor

In my opinion if it is to be changed, the place to change it is in client_golang.

I'm not surprised that summary quantiles are expensive, it's one of the reasons I recommend against them.

@matthiasr
Copy link
Contributor

whether statsd timers are observed into histograms or summaries is already configurable, both per match and globally. I don't know what would need to change in the library?

@matthiasr
Copy link
Contributor

sorry, I wasn't explicit enough. I meant changing the default for the timer_type setting.

@brian-brazil
Copy link
Contributor

Ah, I don't see a problem with changing it.

@shuz
Copy link

shuz commented Jan 23, 2018

Some more findings. In our service we find the actual requests doesn't have too many samples. We are in a pattern that there are many dynamic combinations that are measured, although avg or max of p99 doesn't make too much sense.
We were able to make the memory usage much more smooth by making the coldBuf/hotBuf/quantile stream buf all start from 0 capacity.
Another thing we find is that the Gather() call forks a go routine for each metric. This is also a big cost, so we made TTL for the metrics base on the fact that our metrics rate is quite low. However a good work around is to create a version that Gather metrics without forking go-routines, given our simple collectors used by statsd_exporter.

@matthiasr
Copy link
Contributor

What drawbacks does this approach have? How does this behave with concurrent scrapes? Should we do this without too many go routines generally?

For your use case (aggregation across dimensions) histograms are really a better choice though – you can sum these up, grouped as you like, and get reasonable quantile estimations across dimensions.

@brian-brazil
Copy link
Contributor

It's a client golang thing, but I think using goroutines only for custom collectors might make sense. There's no real need with the standard metric types.

@matthiasr
Copy link
Contributor

Do you mean it's a problem in the library, or in how we use it? How can we improve this?

@brian-brazil
Copy link
Contributor

It's client_golang internals, though having a metric object per child doesn't exactly help. @beorn7

@shuz
Copy link

shuz commented Jan 23, 2018

In our use case, we setup statsd_exporter as a bridge to existing apps. They used to send quite a lot of metrics with labels using dogstatsd format. And that agent kept running well with around 150M memory, since it flush every 10 seconds.

We find that using statsd_exporter, the memory usage always grows and the average memory used by each metric doesn't feel right.

Then we find the number of goroutines is growing with the memory usage:

image

Then we hacked the prometheus golang client lib to use only 1 go routine in Gather():

	// make a copy of collectors to prevent concurrent access
	collectors := make([]Collector, 0, len(r.collectorsByID))
	for _, collector := range r.collectorsByID {
		collectors = append(collectors, collector)
	}

	go func(collectors []Collector) {
		for _, collector := range collectors {
			func(collector Collector) {
				defer wg.Done()
				collector.Collect(metricChan)
			}(collector)
		}
	}(collectors)

It's special case since we are handling quite a lot metrics in the container, but it reduced the memory usage by half.

@beorn7
Copy link
Member

beorn7 commented Jan 25, 2018

Making Gather behave differently depending on the nature of the Collector sounds like a horrible leak in the abstraction. I would much prefer to avoid that. Perhaps it will help to limit the number of goroutines, or do something smarter, which would adapt. I filed prometheus/client_golang#369

@matthiasr
Copy link
Contributor

@brian-brazil just to be sure I understand correctly – part of the problem is how we initialize a new metric object for each label combination here (and in the other Get implementations)? And we could optimize this by only indexing on the metricName only, and always passing the labels to that metric object?

If we ever wanted to expire metrics, that would mean we could only do so on a per-metric not a per-timeseries level, but that's not necessarily bad.

@brian-brazil
Copy link
Contributor

part of the problem is how we initialize a new metric object for each label combination?

Yes.

And we could optimize this by only indexing on the metricName only, and always passing the labels to that metric object?

Yes, though that's presuming client_golang internals don't change.

@matthiasr
Copy link
Contributor

It seems to me like that's also semantically more how metrics objects are meant to be used.

matthiasr pushed a commit that referenced this issue Jan 27, 2018
with the help of some auto-generated events, benchmark the collection
part of the exporter. In #76 it was reported that high label cardinality
can cause issues.
This was referenced Jan 27, 2018
@sathsb
Copy link

sathsb commented May 25, 2018

@matthiasr I still see high memory usage when using histograms. Just curious to know if the client_golang from prometheus/client_golang#370 is merged in statsd_exporter and if it will fix the issue.

@matthiasr
Copy link
Contributor

@sathsb no, I haven't done that yet – I'm sorry for the long silence, I was away for a few months. While I'm still catching up, would you like to submit a PR that updates the vendored dependency/dependencies?

@matthiasr
Copy link
Contributor

There was an upgrade to a newer, but not new-est, client version in #119 – if you want to pick this up you can probably salvage some of the necessary changes from that.

@matthiasr matthiasr changed the title Document high memory usage with summaries Upgrade client_golang to reduce memory usage with many time series Aug 4, 2018
@matthiasr matthiasr changed the title Upgrade client_golang to reduce memory usage with many time series Upgrade client_golang to reduce memory usage with many metrics Aug 4, 2018
mincai pushed a commit to uber/peloton that referenced this issue Jan 7, 2019
Summary:
Timers are very memory heavy for prometheus so I am disabling
prometheus reporting for resource manager. It is not being used right now and
will make master stable by not blocking the performance tests.

There will be a subsequent effort for see we can use histograms instead of timers.

github issue: prometheus/statsd_exporter#76

Test Plan: ran benchmark test

Reviewers: rcharles, #peloton, mabansal

Reviewed By: #peloton, mabansal

Subscribers: jenkins

Maniphest Tasks: T1976007

Differential Revision: https://code.uberinternal.com/D1983523
@matthiasr
Copy link
Contributor

I believe this is done now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants