stats: native prometheus export support #1947

mattklein123 · 2017-10-26T18:46:11Z

Things that need sorting out:

Push vs. pull
If we do native pull, Envoy will need a built-in histogram implementation. We need this for outlier detection also. The implementation I was planning on using is an HDR histogram implemented here: https://github.com/circonus-labs/libcircllhist (stats: add built-in log linear histogram support #1965)

mattklein123 · 2017-10-26T18:47:03Z

Related to envoyproxy/data-plane-api#210

mattklein123 · 2017-10-26T18:47:21Z

emmanuel · 2017-10-26T18:49:06Z

Some broad context and a few points of discussion are here: christian-posta/envoy-microservices-patterns#2 (probably not terribly relevant).

brancz · 2017-10-27T07:31:53Z

Why would Envoy need to implement this itself? The C++ lib for Prometheus that we promote on prometheus.io seems to have histogram support.

I would encourage to improve the existing libraries instead of rolling a custom implementation for native Prometheus support.

mattklein123 · 2017-10-27T15:20:29Z

@brancz Envoy has a very large mount of plumbing already in place for stats, export, etc. It also has a very specific threading model. We also need built-in native HDR histograms in Envoy for other features. We will definitely investigate the various libraries available and figure out the right path forward before anyone starts work on this feature. (We first need to find someone who wants to build this specific feature, having HDR histogram support in Envoy is orthogonal and I will open a separate issue on that).

mattklein123 · 2017-10-27T16:54:38Z

I just synced up with the Prometheus team. Here is the update:

We will go with a pull model for Prometheus stats in Envoy.
We will start with a text endpoint to output counter and gauges only. This will take me about 2 hours so I'm just going to do this.
Later we will need help for
- HDR histogram support in Envoy stat store (may get done orthogonally for other reasons)
- Output histograms in stats endpoints
- Output text+gzip Prometheus format
- Output proto Prometheus format

mattklein123 · 2017-10-28T18:09:14Z

@lita @jmphilli the first part of this (counter/gauge output) is a really good beginner ticket so if you want to give it a go that would be great. Basically we need to do the following:

Add a new /stats handler for Prometheus format here: https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L303
Write the stats out in the following format: https://prometheus.io/docs/instrumenting/exposition_formats/
- skip timestamp
- just write out counters and gauges
- use the new tag extracted name added by @mrice32 and tags/labels for each metric
- optionally also stick in TYPE comments for each metric (though I'm not sure how important this is).

More planning is required for histograms. We can deal with that later.

brancz · 2017-10-29T07:25:58Z

If at all possible, /metrics is the default for Prometheus. It's configurable con Prometheus side, but usually when there is native Prometheus support projects adapt this convention.

Description and type information about a metric is technically optional, however it is best practice to have, as it will allow Prometheus to do certain optimizations based on the type. A description could be helpful for display when querying (both of those things are not implemented today, but have been discussed and the consensus is that this information is good to have). However if this is problematic to add, then this can be done in follow up improvements as well.

SuperQ · 2017-10-31T15:23:14Z

👍 for having /metrics be the Prometheus output path.

mattklein123 · 2017-10-31T15:32:18Z

Sure doing /metrics for the Prometheus admin path is easy enough.

lita · 2017-10-31T15:53:40Z

I can update the path to be /metrics then. Right now, I have implemented to be /status?fomat= prometheus

SuperQ · 2017-11-01T15:45:33Z

@lita Cool, thanks.

Typically for Prometheus, we use Content-Type headers, rather than URL params to adjust transport formats. We use the text format as the fallback method.

mattklein123 · 2017-11-03T16:05:31Z

@lita FYI you can use #1919 to optionally compress the output. I would do this as a follow up.

mattklein123 · 2018-04-23T17:21:08Z

Update here: Once #3130 (review) lands we will be able to trivially export histogram summaries for Prometheus, and then we can consider this issue fixed. If someone wants to sign up for doing the histogram export that would be awesome!!!

ggreenway · 2018-04-23T21:01:25Z

I'll do this eventually if nobody else has yet, but I won't have time to start on it for awhile. If/when I start working on it, I'll comment here and assign it to myself. If that hasn't happened yet, someone else can claim it.

JonathanO · 2018-04-24T13:13:35Z

Is the plan to enable export of histograms as well as summaries?

Summaries are nice for looking at a single instance, but our dashboards produce an aggregate from multiple instances together and to do that the raw histogram type is needed.

mattklein123 · 2018-04-24T17:01:20Z

@JonathanO the way the histogram library works it should be pretty straightforward to export both summaries and the full histograms as desired. cc @ramaraochavali @postwait

SuperQ · 2018-04-24T17:21:29Z

There's no need to bother with summaries, they have almost no usefulness since they can't be aggregated.

ggreenway · 2018-04-24T17:57:40Z

Yeah, I'd argue against doing summaries at all. They can be generated by another (external) tool.

stevesloka · 2018-10-30T00:21:11Z

Is this it? #3130

redhotpenguin · 2018-10-30T04:52:24Z

/stats/ exposes the calculated quantiles, but not the bin counts.

From my dev instance:
cluster.service1.external.upstream_rq_time: P0(nan,13) P25(nan,13.25) P50(nan,13.5) P75(nan,13.75) P90(nan,13.9) P95(nan,13.95) P99(nan,13.99) P99.5(nan,13.995) P99.9(nan,13.999) P100(nan,14)

I'm working on exposing the log linear histogram data serialized in a stats sink. Have been trying to wrangle bazel into building a grpc client to hit the existing endpoint - the default metrics service usage isn't that clear to me currently.

ramaraochavali · 2018-10-30T05:12:22Z

The existing Metrics Service, currently exposes Quantiles only via the gRPC end point. But it is very easy to add the bin support as it follows the Promotheus proto. You should change here https://github.com/envoyproxy/envoy/blob/master/source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.cc#L64 to do that. ParentHistogram has the log linear histogram. You can expose it via ParentHistogram interface.

mattklein123 · 2018-10-30T16:07:15Z

For whoever wants to work on this, here are some code references. I don't think this is too hard to finish, someone just needs to dig in:

https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L620
For an example of how histograms are written out for other things see the gRPC code that @ramaraochavali referenced above or also here: https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L676

Basically the work here as @postwait said is twofold:

Decide if we are going to output standard Prometheus fixed buckets or just output the entire histogram which is also possible.
If fixed buckets, we will need to degrade the internal histogram data to the format Prometheus expects. @postwait @ramaraochavali et al can help here. This will probably involve some histogram interface additions, but nothing too complicated, as we will just be operating on the already latched histograms, not threading issues to contend with.
Then just write them out. :)

dio · 2018-11-01T00:53:31Z

Hey @stevesloka, are you working on this?

stevesloka · 2018-11-01T01:41:30Z

No I have not, if you have cycles feel free to take it. If not I can try soon, my week has gotten away from me.

suhailpatel · 2019-01-15T00:03:13Z

👋, I managed to pick this up and do most of the plumbing in #5601 to expose buckets and plumb that through into the Prometheus output. Hopefully someone can get a chance to review it. I did initially want to get in the configurable buckets but the PR was getting quite complex as is and I wanted to make sure it was along the right lines.

(this is my first real foray in proper C++ code so please do review it with a fine tooth comb 😃, thanks!)

MarcinFalkowski · 2019-01-17T14:01:20Z

Hey,
What about support for a parameter usedonly?(https://www.envoyproxy.io/docs/envoy/latest/operations/admin.html?highlight=usedonly#get--stats?usedonly). It is supported now for statsd and json format, but not for prometheus format: https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L617

From my experience, it is also respected when Envoy publish metrics to statsd. That is a huge advantage, because having 1000+ clusters and connecting to only couple of them, we don't want to publish metrics for every cluster. It would be very helpful if prometheus will support that too.

If you think this is easy enough, I could try to implement this (but I have a very little C++ experience).

brancz · 2019-01-17T14:11:33Z

Exposing only the metrics that have been updated since the last scrape breaks Prometheus in various ways (time-series are incorrectly marked as stale; two different Prometheuses scraping would interfere with each other). Printing these metrics is not a performance concern neither is processing them in Prometheus. Besides that, it's off topic. If you want to only collect metrics from individual Envoy servers, that's just a matter of configuring Prometheus not to scrape those servers.

pschultz · 2019-01-17T14:21:11Z

Agreed. usedonly doesn't make sense for Prometheus. See https://www.robustperception.io/existential-issues-with-metrics.

ramaraochavali · 2019-01-17T14:30:08Z

@brancz @pschultz just clarify the usedonly semantics - usedonly indicates if a metric has ever been updated since Envoy start. This does not indicate whether it has been updated since the last scrape. Even if it is not updated since last scrape if it had ever been updated, that metric will be treated as usedonly.
See https://www.envoyproxy.io/docs/envoy/latest/operations/admin#get--stats?usedonly for more details

brancz · 2019-01-17T14:55:57Z

Thanks for the clarification. It still wouldn't work well with Prometheus as when a process restarts and a counter resets, there is a difference in a counter being 0 and not there at all. The first is the continuation of a time-series, the later means the time-series has ended.

ggreenway · 2019-01-17T15:45:18Z

I am in favor of adding usedonly. It is off by default, so it only affects people that opt in to it. I think it's useful in some cases.

brancz · 2019-01-17T16:15:06Z

For Prometheus it's a violation of the format. For statsd I absolutely see the use.

ggreenway · 2019-01-17T21:49:05Z

@brancz I agree understand your point about why this could be problematic. But the lack of usedonly doesn't prevent this case from happening anyways.

For instance, something Envoy is working on is the ability to lazy-load some configuration. For example, when a connection comes in with an SNI value that hasn't been seen before, Envoy could contact the mgmt/config server and ask for config. This would cause new metrics to appear, and if Envoy were to be restarted, those metrics would disappear again until a connection caused Envoy to load that particular bit of config again.

Another related use case we hear about is configurations that are very very large, where only a small subset is expected to be used on any given Envoy instance. (Some would argue that a better control plane should be used to prevent this, but the world is complicated.) Only publishing metrics for the small portion of config that is used can hugely reduce the number of published metrics.

Given this context, do you still feel that it would always be inappropriate to have usedonly be available for prometheus metrics?

mattklein123 · 2019-01-17T21:51:24Z

Given this context, do you still feel that it would always be inappropriate to have usedonly be available for prometheus metrics?

FWIW I think we should add it. This is a tremendous perf benefit in many cases for metrics backends.

suhailpatel · 2019-01-17T22:10:03Z

I also believe we should add usedonly (it's actually a further addition i'm wrapping up in a separate branch after the histogram changes are merged). It drastically reduces the number of metrics (especially if you have hundreds of clusters).

FWIW: In our set up, when we scrape Envoy with Prometheus, we add a label to each metric to identify the Envoy pod it came from (so we can isolate metrics for a specific Envoy pod). If a pod restarts, it gets a new label combination. It increases the cardinality but for this use case, a usedonly would be perfect.

brancz · 2019-01-17T22:12:21Z

There's a big difference in metrics being hidden because they were never measured/incremented and whether this has happened since the last scrape. The latter would always be problematic for use with Prometheus, but the way I read your description now @ggreenway, it feels like the first. The first I think would be ok to have as it would be a conscious decision by a user, meaning they make a conscious decision and know that a non existent metric may mean the same as a 0 metric.

suhailpatel · 2019-01-17T22:15:49Z

@brancz To clarify, metrics disappearing after the last scrape is a violation and would indeed break a lot of Prometheus set ups. This isn't what's happening here. Metrics accumulate over the lifetime of the Envoy process.

The purpose of usedonly is only excluding metrics which haven't been touched/emitted even once for the lifetime of the process

brancz · 2019-01-18T07:56:40Z

Sorry for misunderstanding. That would be ok if it is an explicit action by a user, which a query parameter would be.

Description: this PR allows the EngineBuilder to configure fallback DNS servers. Risk Level: low -- new API Testing: added tests. Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>

mattklein123 added the enhancement Feature requests. Not bugs or questions. label Oct 26, 2017

mattklein123 mentioned this issue Oct 26, 2017

Prometheus metrics? christian-posta/envoy-microservices-patterns#2

Closed

mattklein123 added the help wanted Needs help! label Oct 26, 2017

mattklein123 mentioned this issue Oct 27, 2017

MDS API envoyproxy/data-plane-api#210

Closed

lita mentioned this issue Nov 8, 2017

stats: add prometheus formatted stats in the admin endpoint #2026

Merged

mattklein123 added this to the 1.6.0 milestone Nov 26, 2017

taiki45 mentioned this issue Dec 1, 2017

stats: native datadog tagged metric support #2024

Closed

ccmtaylor mentioned this issue Dec 11, 2017

stats: expose Prometheus metrics as /metrics #2182

Closed

mattklein123 modified the milestones: 1.6.0, 1.7.0 Mar 5, 2018

taiki45 mentioned this issue Mar 13, 2018

[envoy] new integration DataDog/integrations-core#1156

Merged

mattklein123 modified the milestones: 1.9.0, 1.10.0 Dec 14, 2018

rosskukulinski mentioned this issue Jan 7, 2019

Metrics update projectcontour/contour#718

Closed

suhailpatel mentioned this issue Jan 15, 2019

stats: add support for histograms in prometheus export #5601

Merged

mattklein123 closed this as completed in #5601 Jan 31, 2019

eightnoteight mentioned this issue Dec 24, 2019

stats: Configurable Histogram Buckets #7599

Closed

stats: native prometheus export support #1947

stats: native prometheus export support #1947

Comments

mattklein123 commented Oct 26, 2017 • edited Loading

mattklein123 commented Oct 26, 2017

mattklein123 commented Oct 26, 2017

emmanuel commented Oct 26, 2017

brancz commented Oct 27, 2017

mattklein123 commented Oct 27, 2017

mattklein123 commented Oct 27, 2017

mattklein123 commented Oct 28, 2017 • edited Loading

brancz commented Oct 29, 2017

SuperQ commented Oct 31, 2017

mattklein123 commented Oct 31, 2017

lita commented Oct 31, 2017

SuperQ commented Nov 1, 2017

mattklein123 commented Nov 3, 2017

mattklein123 commented Apr 23, 2018

ggreenway commented Apr 23, 2018

JonathanO commented Apr 24, 2018

mattklein123 commented Apr 24, 2018

SuperQ commented Apr 24, 2018

ggreenway commented Apr 24, 2018

stevesloka commented Oct 30, 2018

redhotpenguin commented Oct 30, 2018

ramaraochavali commented Oct 30, 2018

mattklein123 commented Oct 30, 2018 • edited Loading

dio commented Nov 1, 2018

stevesloka commented Nov 1, 2018 • edited Loading

suhailpatel commented Jan 15, 2019 • edited Loading

MarcinFalkowski commented Jan 17, 2019

brancz commented Jan 17, 2019

pschultz commented Jan 17, 2019

ramaraochavali commented Jan 17, 2019

brancz commented Jan 17, 2019

ggreenway commented Jan 17, 2019

brancz commented Jan 17, 2019

ggreenway commented Jan 17, 2019

mattklein123 commented Jan 17, 2019

suhailpatel commented Jan 17, 2019 • edited Loading

brancz commented Jan 17, 2019

suhailpatel commented Jan 17, 2019 • edited Loading

brancz commented Jan 18, 2019

mattklein123 commented Oct 26, 2017 •

edited

Loading

mattklein123 commented Oct 28, 2017 •

edited

Loading

mattklein123 commented Oct 30, 2018 •

edited

Loading

stevesloka commented Nov 1, 2018 •

edited

Loading

suhailpatel commented Jan 15, 2019 •

edited

Loading

suhailpatel commented Jan 17, 2019 •

edited

Loading

suhailpatel commented Jan 17, 2019 •

edited

Loading