Upgrade client_golang to reduce memory usage with many metrics #76

jondot · 2017-08-01T13:22:11Z

Hi,
I'm seeing a ratio of 10k metrics per 100M of RAM, for production metrics. Interested to hear first if that's normal, and then, what's a good strategy for handling 1M metrics (this also affects the metrics endpoint polled by prom, obviously).

In my tests for 1M metrics, I've gotten 28GB of RAM and growing, and all cores max out (had to kill the process).

The text was updated successfully, but these errors were encountered:

SuperQ · 2017-08-12T10:58:28Z

Are you running one central statd_exporter?

The general recommendation is to run one exporter per node, and configure all apps to send to localhost. This allows for easy scaling and eliminates the exporter as a SPoF.

It also helps if you are using UDP to transport statsd metrics, as packet loss is less of an issue.

matthiasr · 2017-11-10T15:59:14Z

That comes out as ~10KiB per metric, that is indeed a lot.

As @SuperQ says, the first thing I would do is colocate the exporter with every data source, instead of the central statsd server.

I am still interested in getting to the bottom of the memory usage. What are all these bytes? Do we really need them? Any input is welcome!

matthiasr · 2018-01-11T10:24:21Z

I got a notification for a comment that made good points but appears to be lost?

This might be related to summaries with quantile – if you have a lot of timer metrics, these are translated into summaries and can be this expensive.

Now, generally histograms are probably a better idea here, but open the question of how to configure buckets. On the other hand, one way to mitigate the memory cost of quantiles would be to make the MaxAge, BufCap and AgeBuckets also configurable, so the complexity would be the same.

How should this look in configuration?

shuz · 2018-01-11T13:30:02Z

@matthiasr I deleted the comment because I find my metrics is still OOM after I switch to histogram (default buckets). And the calculated average memory is still quite high. You can try it using this configuration:

defaults:
  timer_type: histogram

shuz · 2018-01-11T13:36:28Z

MaxAge, BufCap and AgeBuckets are not configurable now.

I'm trying to make them configurable globally in my branch:

func (c *SummaryContainer) Get(metricName string, labels prometheus.Labels, help string) (prometheus.Summary, error) {
	hash := hashNameAndLabels(metricName, labels)
	summary, ok := c.Elements[hash]
	if !ok {
		summary = prometheus.NewSummary(
			prometheus.SummaryOpts{
				Name:        metricName,
				Help:        help,
				ConstLabels: labels,
				BufCap:      500,
				AgeBuckets:  2,
				MaxAge:      120 * time.Second,
			})

If it works they should also be configurable per metric.

shuz · 2018-01-11T16:30:11Z

Some testing results. It seems related to the summary timer type.

Test1: 80000 metrics, no timers 
80000 time series
Memory: 100-200m

Test2: 20000 timers with histogram with 10 buckets
280000 time series
memory: 50-60m

Test3: 20000 timers with summary (0.5:0.05, 0.9:0.01, 0.99:0.001), constant value (always 200ms)
100000 time series
memory: 1.7-5.4g

Test4: 20000 timers with summary (0.5:0.05, 0.95:0.005), constant value (always 200ms)
memory: similar to test3

Test5: 20000 timers with summary (0.5:0.05, 0.95:0.005), constant value (always 200ms)
AgeBuckets=2
memory: About 1.3g

Some findings:

When there are enough stats sent, the BufCap and MaxAge doesn't really impact the memory usage.
AgeBuckets=1 is optimal in memory but may result in unstable quantile values. If I use some real data instead of 200ms constant value, the memory usage will be higher. In my test, AgeBuckets=2 reached about 3G memory when using random values r.NormFloat64()*50+200.
Changing the targets from (0.5:0.05, 0.9:0.01, 0.99:0.001) to (0.5:0.05, 0.95:0.005) doesn't change the memory usage a lot.

matthiasr · 2018-01-11T16:57:03Z

That's a really good find, thanks a lot @shuz. At the very least, this is something to mention in the README.

How do people feel about changing the default?

brian-brazil · 2018-01-11T17:04:52Z

In my opinion if it is to be changed, the place to change it is in client_golang.

I'm not surprised that summary quantiles are expensive, it's one of the reasons I recommend against them.

matthiasr · 2018-01-11T17:09:39Z

whether statsd timers are observed into histograms or summaries is already configurable, both per match and globally. I don't know what would need to change in the library?

matthiasr · 2018-01-11T17:10:27Z

sorry, I wasn't explicit enough. I meant changing the default for the timer_type setting.

brian-brazil · 2018-01-11T17:16:23Z

Ah, I don't see a problem with changing it.

shuz · 2018-01-23T13:03:40Z

Some more findings. In our service we find the actual requests doesn't have too many samples. We are in a pattern that there are many dynamic combinations that are measured, although avg or max of p99 doesn't make too much sense.
We were able to make the memory usage much more smooth by making the coldBuf/hotBuf/quantile stream buf all start from 0 capacity.
Another thing we find is that the Gather() call forks a go routine for each metric. This is also a big cost, so we made TTL for the metrics base on the fact that our metrics rate is quite low. However a good work around is to create a version that Gather metrics without forking go-routines, given our simple collectors used by statsd_exporter.

matthiasr · 2018-01-23T14:06:12Z

What drawbacks does this approach have? How does this behave with concurrent scrapes? Should we do this without too many go routines generally?

For your use case (aggregation across dimensions) histograms are really a better choice though – you can sum these up, grouped as you like, and get reasonable quantile estimations across dimensions.

brian-brazil · 2018-01-23T14:10:42Z

It's a client golang thing, but I think using goroutines only for custom collectors might make sense. There's no real need with the standard metric types.

matthiasr · 2018-01-23T17:39:17Z

Do you mean it's a problem in the library, or in how we use it? How can we improve this?

brian-brazil · 2018-01-23T18:11:33Z

It's client_golang internals, though having a metric object per child doesn't exactly help. @beorn7

shuz · 2018-01-23T21:57:28Z

In our use case, we setup statsd_exporter as a bridge to existing apps. They used to send quite a lot of metrics with labels using dogstatsd format. And that agent kept running well with around 150M memory, since it flush every 10 seconds.

We find that using statsd_exporter, the memory usage always grows and the average memory used by each metric doesn't feel right.

Then we find the number of goroutines is growing with the memory usage:

Then we hacked the prometheus golang client lib to use only 1 go routine in Gather():

	// make a copy of collectors to prevent concurrent access
	collectors := make([]Collector, 0, len(r.collectorsByID))
	for _, collector := range r.collectorsByID {
		collectors = append(collectors, collector)
	}

	go func(collectors []Collector) {
		for _, collector := range collectors {
			func(collector Collector) {
				defer wg.Done()
				collector.Collect(metricChan)
			}(collector)
		}
	}(collectors)

It's special case since we are handling quite a lot metrics in the container, but it reduced the memory usage by half.

beorn7 · 2018-01-25T11:45:56Z

Making Gather behave differently depending on the nature of the Collector sounds like a horrible leak in the abstraction. I would much prefer to avoid that. Perhaps it will help to limit the number of goroutines, or do something smarter, which would adapt. I filed prometheus/client_golang#369

matthiasr · 2018-01-25T13:23:37Z

@brian-brazil just to be sure I understand correctly – part of the problem is how we initialize a new metric object for each label combination here (and in the other Get implementations)? And we could optimize this by only indexing on the metricName only, and always passing the labels to that metric object?

If we ever wanted to expire metrics, that would mean we could only do so on a per-metric not a per-timeseries level, but that's not necessarily bad.

brian-brazil · 2018-01-25T13:24:38Z

part of the problem is how we initialize a new metric object for each label combination?

Yes.

And we could optimize this by only indexing on the metricName only, and always passing the labels to that metric object?

Yes, though that's presuming client_golang internals don't change.

matthiasr · 2018-01-25T13:36:59Z

It seems to me like that's also semantically more how metrics objects are meant to be used.

with the help of some auto-generated events, benchmark the collection part of the exporter. In #76 it was reported that high label cardinality can cause issues.

sathsb · 2018-05-25T08:54:02Z

@matthiasr I still see high memory usage when using histograms. Just curious to know if the client_golang from prometheus/client_golang#370 is merged in statsd_exporter and if it will fix the issue.

matthiasr · 2018-07-02T14:28:21Z

@sathsb no, I haven't done that yet – I'm sorry for the long silence, I was away for a few months. While I'm still catching up, would you like to submit a PR that updates the vendored dependency/dependencies?

matthiasr · 2018-07-02T15:58:03Z

There was an upgrade to a newer, but not new-est, client version in #119 – if you want to pick this up you can probably salvage some of the necessary changes from that.

Summary: Timers are very memory heavy for prometheus so I am disabling prometheus reporting for resource manager. It is not being used right now and will make master stable by not blocking the performance tests. There will be a subsequent effort for see we can use histograms instead of timers. github issue: prometheus/statsd_exporter#76 Test Plan: ran benchmark test Reviewers: rcharles, #peloton, mabansal Reviewed By: #peloton, mabansal Subscribers: jenkins Maniphest Tasks: T1976007 Differential Revision: https://code.uberinternal.com/D1983523

matthiasr · 2019-05-24T12:02:16Z

I believe this is done now.

SuperQ added the question label Aug 12, 2017

SuperQ self-assigned this Aug 12, 2017

matthiasr assigned matthiasr and unassigned SuperQ and matthiasr Jan 11, 2018

matthiasr added enhancement and removed question labels Jan 11, 2018

matthiasr changed the title ~~Memory usage vs metric amount~~ Document high memory usage with summaries Jan 11, 2018

beorn7 mentioned this issue Jan 26, 2018

Create goroutines adaptively during metrics gathering prometheus/client_golang#370

Merged

matthiasr pushed a commit that referenced this issue Jan 27, 2018

Add a benchmark for Gather()ing the metrics

5a9bc77

with the help of some auto-generated events, benchmark the collection part of the exporter. In #76 it was reported that high label cardinality can cause issues.

This was referenced Jan 27, 2018

Create one metric per metric #119

Closed

Metricsregistry #120

Closed

matthiasr changed the title ~~Document high memory usage with summaries~~ Upgrade client_golang to reduce memory usage with many time series Aug 4, 2018

matthiasr changed the title ~~Upgrade client_golang to reduce memory usage with many time series~~ Upgrade client_golang to reduce memory usage with many metrics Aug 4, 2018

douglas-reid mentioned this issue Aug 16, 2018

statsd prometheus bridge is gradually OOMing istio/istio#7873

Closed

aud10slave mentioned this issue Aug 16, 2018

Wrong import at m3db/prometheus_client_golang/prometheus/promhttp/http.go:46 uber-go/tally#78

Open

matthiasr closed this as completed May 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade client_golang to reduce memory usage with many metrics #76

Upgrade client_golang to reduce memory usage with many metrics #76

jondot commented Aug 1, 2017

SuperQ commented Aug 12, 2017

matthiasr commented Nov 10, 2017

matthiasr commented Jan 11, 2018

shuz commented Jan 11, 2018

shuz commented Jan 11, 2018

shuz commented Jan 11, 2018

matthiasr commented Jan 11, 2018

brian-brazil commented Jan 11, 2018

matthiasr commented Jan 11, 2018

matthiasr commented Jan 11, 2018

brian-brazil commented Jan 11, 2018

shuz commented Jan 23, 2018

matthiasr commented Jan 23, 2018

brian-brazil commented Jan 23, 2018

matthiasr commented Jan 23, 2018

brian-brazil commented Jan 23, 2018

shuz commented Jan 23, 2018

beorn7 commented Jan 25, 2018

matthiasr commented Jan 25, 2018

brian-brazil commented Jan 25, 2018

matthiasr commented Jan 25, 2018

sathsb commented May 25, 2018

matthiasr commented Jul 2, 2018

matthiasr commented Jul 2, 2018

matthiasr commented May 24, 2019

Upgrade client_golang to reduce memory usage with many metrics #76

Upgrade client_golang to reduce memory usage with many metrics #76

Comments

jondot commented Aug 1, 2017

SuperQ commented Aug 12, 2017

matthiasr commented Nov 10, 2017

matthiasr commented Jan 11, 2018

shuz commented Jan 11, 2018

shuz commented Jan 11, 2018

shuz commented Jan 11, 2018

matthiasr commented Jan 11, 2018

brian-brazil commented Jan 11, 2018

matthiasr commented Jan 11, 2018

matthiasr commented Jan 11, 2018

brian-brazil commented Jan 11, 2018

shuz commented Jan 23, 2018

matthiasr commented Jan 23, 2018

brian-brazil commented Jan 23, 2018

matthiasr commented Jan 23, 2018

brian-brazil commented Jan 23, 2018

shuz commented Jan 23, 2018

beorn7 commented Jan 25, 2018

matthiasr commented Jan 25, 2018

brian-brazil commented Jan 25, 2018

matthiasr commented Jan 25, 2018

sathsb commented May 25, 2018

matthiasr commented Jul 2, 2018

matthiasr commented Jul 2, 2018

matthiasr commented May 24, 2019