Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: native prometheus export support #1947

Closed
1 of 2 tasks
mattklein123 opened this issue Oct 26, 2017 · 71 comments
Closed
1 of 2 tasks

stats: native prometheus export support #1947

mattklein123 opened this issue Oct 26, 2017 · 71 comments
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help!
Milestone

Comments

@mattklein123
Copy link
Member

mattklein123 commented Oct 26, 2017

Things that need sorting out:

@mattklein123
Copy link
Member Author

Related to envoyproxy/data-plane-api#210

@mattklein123
Copy link
Member Author

cc @mrice32

@mattklein123 mattklein123 added the help wanted Needs help! label Oct 26, 2017
@emmanuel
Copy link

Some broad context and a few points of discussion are here: christian-posta/envoy-microservices-patterns#2 (probably not terribly relevant).

@brancz
Copy link

brancz commented Oct 27, 2017

Why would Envoy need to implement this itself? The C++ lib for Prometheus that we promote on prometheus.io seems to have histogram support.

I would encourage to improve the existing libraries instead of rolling a custom implementation for native Prometheus support.

@mattklein123
Copy link
Member Author

@brancz Envoy has a very large mount of plumbing already in place for stats, export, etc. It also has a very specific threading model. We also need built-in native HDR histograms in Envoy for other features. We will definitely investigate the various libraries available and figure out the right path forward before anyone starts work on this feature. (We first need to find someone who wants to build this specific feature, having HDR histogram support in Envoy is orthogonal and I will open a separate issue on that).

@mattklein123
Copy link
Member Author

I just synced up with the Prometheus team. Here is the update:

  • We will go with a pull model for Prometheus stats in Envoy.
  • We will start with a text endpoint to output counter and gauges only. This will take me about 2 hours so I'm just going to do this.
  • Later we will need help for
    • HDR histogram support in Envoy stat store (may get done orthogonally for other reasons)
    • Output histograms in stats endpoints
    • Output text+gzip Prometheus format
    • Output proto Prometheus format

@mattklein123
Copy link
Member Author

mattklein123 commented Oct 28, 2017

@lita @jmphilli the first part of this (counter/gauge output) is a really good beginner ticket so if you want to give it a go that would be great. Basically we need to do the following:

More planning is required for histograms. We can deal with that later.

@brancz
Copy link

brancz commented Oct 29, 2017

If at all possible, /metrics is the default for Prometheus. It's configurable con Prometheus side, but usually when there is native Prometheus support projects adapt this convention.

Description and type information about a metric is technically optional, however it is best practice to have, as it will allow Prometheus to do certain optimizations based on the type. A description could be helpful for display when querying (both of those things are not implemented today, but have been discussed and the consensus is that this information is good to have). However if this is problematic to add, then this can be done in follow up improvements as well.

@SuperQ
Copy link

SuperQ commented Oct 31, 2017

👍 for having /metrics be the Prometheus output path.

@mattklein123
Copy link
Member Author

Sure doing /metrics for the Prometheus admin path is easy enough.

@lita
Copy link

lita commented Oct 31, 2017

I can update the path to be /metrics then. Right now, I have implemented to be /status?fomat= prometheus

@SuperQ
Copy link

SuperQ commented Nov 1, 2017

@lita Cool, thanks.

Typically for Prometheus, we use Content-Type headers, rather than URL params to adjust transport formats. We use the text format as the fallback method.

@mattklein123
Copy link
Member Author

@lita FYI you can use #1919 to optionally compress the output. I would do this as a follow up.

@mattklein123
Copy link
Member Author

Update here: Once #3130 (review) lands we will be able to trivially export histogram summaries for Prometheus, and then we can consider this issue fixed. If someone wants to sign up for doing the histogram export that would be awesome!!!

@ggreenway
Copy link
Contributor

I'll do this eventually if nobody else has yet, but I won't have time to start on it for awhile. If/when I start working on it, I'll comment here and assign it to myself. If that hasn't happened yet, someone else can claim it.

@JonathanO
Copy link
Contributor

Is the plan to enable export of histograms as well as summaries?

Summaries are nice for looking at a single instance, but our dashboards produce an aggregate from multiple instances together and to do that the raw histogram type is needed.

@mattklein123
Copy link
Member Author

@JonathanO the way the histogram library works it should be pretty straightforward to export both summaries and the full histograms as desired. cc @ramaraochavali @postwait

@SuperQ
Copy link

SuperQ commented Apr 24, 2018

There's no need to bother with summaries, they have almost no usefulness since they can't be aggregated.

@ggreenway
Copy link
Contributor

Yeah, I'd argue against doing summaries at all. They can be generated by another (external) tool.

@stevesloka
Copy link
Member

Is this it? #3130

@redhotpenguin
Copy link

/stats/ exposes the calculated quantiles, but not the bin counts.

From my dev instance:
cluster.service1.external.upstream_rq_time: P0(nan,13) P25(nan,13.25) P50(nan,13.5) P75(nan,13.75) P90(nan,13.9) P95(nan,13.95) P99(nan,13.99) P99.5(nan,13.995) P99.9(nan,13.999) P100(nan,14)

I'm working on exposing the log linear histogram data serialized in a stats sink. Have been trying to wrangle bazel into building a grpc client to hit the existing endpoint - the default metrics service usage isn't that clear to me currently.

@ramaraochavali
Copy link
Contributor

The existing Metrics Service, currently exposes Quantiles only via the gRPC end point. But it is very easy to add the bin support as it follows the Promotheus proto. You should change here https://github.com/envoyproxy/envoy/blob/master/source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.cc#L64 to do that. ParentHistogram has the log linear histogram. You can expose it via ParentHistogram interface.

@mattklein123
Copy link
Member Author

mattklein123 commented Oct 30, 2018

For whoever wants to work on this, here are some code references. I don't think this is too hard to finish, someone just needs to dig in:

  1. https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L620
  2. For an example of how histograms are written out for other things see the gRPC code that @ramaraochavali referenced above or also here: https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L676

Basically the work here as @postwait said is twofold:

  1. Decide if we are going to output standard Prometheus fixed buckets or just output the entire histogram which is also possible.
  2. If fixed buckets, we will need to degrade the internal histogram data to the format Prometheus expects. @postwait @ramaraochavali et al can help here. This will probably involve some histogram interface additions, but nothing too complicated, as we will just be operating on the already latched histograms, not threading issues to contend with.
  3. Then just write them out. :)

@dio
Copy link
Member

dio commented Nov 1, 2018

Hey @stevesloka, are you working on this?

@stevesloka
Copy link
Member

stevesloka commented Nov 1, 2018

No I have not, if you have cycles feel free to take it. If not I can try soon, my week has gotten away from me.

@suhailpatel
Copy link
Contributor

suhailpatel commented Jan 15, 2019

👋, I managed to pick this up and do most of the plumbing in #5601 to expose buckets and plumb that through into the Prometheus output. Hopefully someone can get a chance to review it. I did initially want to get in the configurable buckets but the PR was getting quite complex as is and I wanted to make sure it was along the right lines.

(this is my first real foray in proper C++ code so please do review it with a fine tooth comb 😃, thanks!)

@MarcinFalkowski
Copy link
Contributor

Hey,
What about support for a parameter usedonly?(https://www.envoyproxy.io/docs/envoy/latest/operations/admin.html?highlight=usedonly#get--stats?usedonly). It is supported now for statsd and json format, but not for prometheus format: https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L617

From my experience, it is also respected when Envoy publish metrics to statsd. That is a huge advantage, because having 1000+ clusters and connecting to only couple of them, we don't want to publish metrics for every cluster. It would be very helpful if prometheus will support that too.

If you think this is easy enough, I could try to implement this (but I have a very little C++ experience).

@brancz
Copy link

brancz commented Jan 17, 2019

Exposing only the metrics that have been updated since the last scrape breaks Prometheus in various ways (time-series are incorrectly marked as stale; two different Prometheuses scraping would interfere with each other). Printing these metrics is not a performance concern neither is processing them in Prometheus. Besides that, it's off topic. If you want to only collect metrics from individual Envoy servers, that's just a matter of configuring Prometheus not to scrape those servers.

@pschultz
Copy link

Agreed. usedonly doesn't make sense for Prometheus. See https://www.robustperception.io/existential-issues-with-metrics.

@ramaraochavali
Copy link
Contributor

@brancz @pschultz just clarify the usedonly semantics - usedonly indicates if a metric has ever been updated since Envoy start. This does not indicate whether it has been updated since the last scrape. Even if it is not updated since last scrape if it had ever been updated, that metric will be treated as usedonly.
See https://www.envoyproxy.io/docs/envoy/latest/operations/admin#get--stats?usedonly for more details

@brancz
Copy link

brancz commented Jan 17, 2019

Thanks for the clarification. It still wouldn't work well with Prometheus as when a process restarts and a counter resets, there is a difference in a counter being 0 and not there at all. The first is the continuation of a time-series, the later means the time-series has ended.

@ggreenway
Copy link
Contributor

I am in favor of adding usedonly. It is off by default, so it only affects people that opt in to it. I think it's useful in some cases.

@brancz
Copy link

brancz commented Jan 17, 2019

For Prometheus it's a violation of the format. For statsd I absolutely see the use.

@ggreenway
Copy link
Contributor

@brancz I agree understand your point about why this could be problematic. But the lack of usedonly doesn't prevent this case from happening anyways.

For instance, something Envoy is working on is the ability to lazy-load some configuration. For example, when a connection comes in with an SNI value that hasn't been seen before, Envoy could contact the mgmt/config server and ask for config. This would cause new metrics to appear, and if Envoy were to be restarted, those metrics would disappear again until a connection caused Envoy to load that particular bit of config again.

Another related use case we hear about is configurations that are very very large, where only a small subset is expected to be used on any given Envoy instance. (Some would argue that a better control plane should be used to prevent this, but the world is complicated.) Only publishing metrics for the small portion of config that is used can hugely reduce the number of published metrics.

Given this context, do you still feel that it would always be inappropriate to have usedonly be available for prometheus metrics?

@mattklein123
Copy link
Member Author

Given this context, do you still feel that it would always be inappropriate to have usedonly be available for prometheus metrics?

FWIW I think we should add it. This is a tremendous perf benefit in many cases for metrics backends.

@suhailpatel
Copy link
Contributor

suhailpatel commented Jan 17, 2019

I also believe we should add usedonly (it's actually a further addition i'm wrapping up in a separate branch after the histogram changes are merged). It drastically reduces the number of metrics (especially if you have hundreds of clusters).

FWIW: In our set up, when we scrape Envoy with Prometheus, we add a label to each metric to identify the Envoy pod it came from (so we can isolate metrics for a specific Envoy pod). If a pod restarts, it gets a new label combination. It increases the cardinality but for this use case, a usedonly would be perfect.

@brancz
Copy link

brancz commented Jan 17, 2019

There's a big difference in metrics being hidden because they were never measured/incremented and whether this has happened since the last scrape. The latter would always be problematic for use with Prometheus, but the way I read your description now @ggreenway, it feels like the first. The first I think would be ok to have as it would be a conscious decision by a user, meaning they make a conscious decision and know that a non existent metric may mean the same as a 0 metric.

@suhailpatel
Copy link
Contributor

suhailpatel commented Jan 17, 2019

@brancz To clarify, metrics disappearing after the last scrape is a violation and would indeed break a lot of Prometheus set ups. This isn't what's happening here. Metrics accumulate over the lifetime of the Envoy process.

The purpose of usedonly is only excluding metrics which haven't been touched/emitted even once for the lifetime of the process

@brancz
Copy link

brancz commented Jan 18, 2019

Sorry for misunderstanding. That would be ok if it is an explicit action by a user, which a query parameter would be.

jpsim pushed a commit that referenced this issue Nov 28, 2022
Description: this PR allows the EngineBuilder to configure fallback DNS servers.
Risk Level: low -- new API
Testing: added tests.

Signed-off-by: Jose Nino <[email protected]>
Signed-off-by: JP Simard <[email protected]>
jpsim pushed a commit that referenced this issue Nov 29, 2022
Description: this PR allows the EngineBuilder to configure fallback DNS servers.
Risk Level: low -- new API
Testing: added tests.

Signed-off-by: Jose Nino <[email protected]>
Signed-off-by: JP Simard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help!
Projects
None yet
Development

No branches or pull requests