-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats: native prometheus export support #1947
Comments
Related to envoyproxy/data-plane-api#210 |
cc @mrice32 |
Some broad context and a few points of discussion are here: christian-posta/envoy-microservices-patterns#2 (probably not terribly relevant). |
Why would Envoy need to implement this itself? The C++ lib for Prometheus that we promote on prometheus.io seems to have histogram support. I would encourage to improve the existing libraries instead of rolling a custom implementation for native Prometheus support. |
@brancz Envoy has a very large mount of plumbing already in place for stats, export, etc. It also has a very specific threading model. We also need built-in native HDR histograms in Envoy for other features. We will definitely investigate the various libraries available and figure out the right path forward before anyone starts work on this feature. (We first need to find someone who wants to build this specific feature, having HDR histogram support in Envoy is orthogonal and I will open a separate issue on that). |
I just synced up with the Prometheus team. Here is the update:
|
@lita @jmphilli the first part of this (counter/gauge output) is a really good beginner ticket so if you want to give it a go that would be great. Basically we need to do the following:
More planning is required for histograms. We can deal with that later. |
If at all possible, Description and type information about a metric is technically optional, however it is best practice to have, as it will allow Prometheus to do certain optimizations based on the type. A description could be helpful for display when querying (both of those things are not implemented today, but have been discussed and the consensus is that this information is good to have). However if this is problematic to add, then this can be done in follow up improvements as well. |
👍 for having |
Sure doing |
I can update the path to be /metrics then. Right now, I have implemented to be /status?fomat= prometheus |
@lita Cool, thanks. Typically for Prometheus, we use |
Update here: Once #3130 (review) lands we will be able to trivially export histogram summaries for Prometheus, and then we can consider this issue fixed. If someone wants to sign up for doing the histogram export that would be awesome!!! |
I'll do this eventually if nobody else has yet, but I won't have time to start on it for awhile. If/when I start working on it, I'll comment here and assign it to myself. If that hasn't happened yet, someone else can claim it. |
Is the plan to enable export of histograms as well as summaries? Summaries are nice for looking at a single instance, but our dashboards produce an aggregate from multiple instances together and to do that the raw histogram type is needed. |
@JonathanO the way the histogram library works it should be pretty straightforward to export both summaries and the full histograms as desired. cc @ramaraochavali @postwait |
There's no need to bother with summaries, they have almost no usefulness since they can't be aggregated. |
Yeah, I'd argue against doing summaries at all. They can be generated by another (external) tool. |
Is this it? #3130 |
From my dev instance: I'm working on exposing the log linear histogram data serialized in a stats sink. Have been trying to wrangle bazel into building a grpc client to hit the existing endpoint - the default metrics service usage isn't that clear to me currently. |
The existing Metrics Service, currently exposes Quantiles only via the gRPC end point. But it is very easy to add the bin support as it follows the Promotheus proto. You should change here https://github.com/envoyproxy/envoy/blob/master/source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.cc#L64 to do that. ParentHistogram has the log linear histogram. You can expose it via ParentHistogram interface. |
For whoever wants to work on this, here are some code references. I don't think this is too hard to finish, someone just needs to dig in:
Basically the work here as @postwait said is twofold:
|
Hey @stevesloka, are you working on this? |
No I have not, if you have cycles feel free to take it. If not I can try soon, my week has gotten away from me. |
👋, I managed to pick this up and do most of the plumbing in #5601 to expose buckets and plumb that through into the Prometheus output. Hopefully someone can get a chance to review it. I did initially want to get in the configurable buckets but the PR was getting quite complex as is and I wanted to make sure it was along the right lines. (this is my first real foray in proper C++ code so please do review it with a fine tooth comb 😃, thanks!) |
Hey, From my experience, it is also respected when Envoy publish metrics to If you think this is easy enough, I could try to implement this (but I have a very little C++ experience). |
Exposing only the metrics that have been updated since the last scrape breaks Prometheus in various ways (time-series are incorrectly marked as stale; two different Prometheuses scraping would interfere with each other). Printing these metrics is not a performance concern neither is processing them in Prometheus. Besides that, it's off topic. If you want to only collect metrics from individual Envoy servers, that's just a matter of configuring Prometheus not to scrape those servers. |
Agreed. |
@brancz @pschultz just clarify the |
Thanks for the clarification. It still wouldn't work well with Prometheus as when a process restarts and a counter resets, there is a difference in a counter being 0 and not there at all. The first is the continuation of a time-series, the later means the time-series has ended. |
I am in favor of adding usedonly. It is off by default, so it only affects people that opt in to it. I think it's useful in some cases. |
For Prometheus it's a violation of the format. For statsd I absolutely see the use. |
@brancz I agree understand your point about why this could be problematic. But the lack of For instance, something Envoy is working on is the ability to lazy-load some configuration. For example, when a connection comes in with an SNI value that hasn't been seen before, Envoy could contact the mgmt/config server and ask for config. This would cause new metrics to appear, and if Envoy were to be restarted, those metrics would disappear again until a connection caused Envoy to load that particular bit of config again. Another related use case we hear about is configurations that are very very large, where only a small subset is expected to be used on any given Envoy instance. (Some would argue that a better control plane should be used to prevent this, but the world is complicated.) Only publishing metrics for the small portion of config that is used can hugely reduce the number of published metrics. Given this context, do you still feel that it would always be inappropriate to have |
FWIW I think we should add it. This is a tremendous perf benefit in many cases for metrics backends. |
I also believe we should add FWIW: In our set up, when we scrape Envoy with Prometheus, we add a label to each metric to identify the Envoy pod it came from (so we can isolate metrics for a specific Envoy pod). If a pod restarts, it gets a new label combination. It increases the cardinality but for this use case, a |
There's a big difference in metrics being hidden because they were never measured/incremented and whether this has happened since the last scrape. The latter would always be problematic for use with Prometheus, but the way I read your description now @ggreenway, it feels like the first. The first I think would be ok to have as it would be a conscious decision by a user, meaning they make a conscious decision and know that a non existent metric may mean the same as a 0 metric. |
@brancz To clarify, metrics disappearing after the last scrape is a violation and would indeed break a lot of Prometheus set ups. This isn't what's happening here. Metrics accumulate over the lifetime of the Envoy process. The purpose of |
Sorry for misunderstanding. That would be ok if it is an explicit action by a user, which a query parameter would be. |
Description: this PR allows the EngineBuilder to configure fallback DNS servers. Risk Level: low -- new API Testing: added tests. Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>
Description: this PR allows the EngineBuilder to configure fallback DNS servers. Risk Level: low -- new API Testing: added tests. Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>
Things that need sorting out:
The text was updated successfully, but these errors were encountered: