-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats: prometheus scrape failed #27173
Comments
It seems that some metrics are now rendered multiple times (effectively being treated as independent metrics rather then the same metric but with differing tags). Example of a problematic metric in after.txt
vs same metric in before.txt
|
based on the attached after.txt, the metrics whose output is incorrect are
while all other stats are rendered correctly.. and as you can see from the dupes progressively decrease. Hm, weird: I wonder if this is due to something specific to these tests or if there's some underlying issue introduced by #24998 |
@ggreenway for any ideas |
Scanning the issue description, it seems like the issue is that previously It should be easy to repro in a unit test. I'm hoping we can fix forward quickly. |
I am afraid is not a simple issue with the way a metric is rendered, something seems amiss on a more fundamental level. I can easily observe the same issue if I run Envoy locally with a few clusters configured; when I curl the prometheus stats admin endpoint I can see that the output is completely jumbled up. Here's an excerpt of the output:
The expected result is that metrics with the same name (e.g. envoy_cluster_assignment_stale) but different tags (e.g. envoy_cluster_name) are grouped together and rendered as one "block" (with one "# TYPE" line preceding each group), e.g
I tried briefly to reproduce this issue in a test but didn't manage so far. I don't have any leads yet so we might need to revert the changes in #24998 unless @jmarantz or @ggreenway have any ideas or pointers on what might be going on? |
@zirain would you be able to provide a bootstrap config to help repro? Thanks! |
You can take configs/envoy-demo and duplicate the cluster. |
I see. I'm thinking the problem is that tag-extracted names that need to be grouped together are coming from different Stats::Scope objects, and we are streaming them out from the first scope before examining the second scope. This is going to require some thought on how to fix without compromising the speed/memory benefit offered by the new algo @rulex123 deployed. |
I'm pretty sure we can fix this by considering all the scopes named the same thing at the same time. E.g. don't stream out the results from a scope named "foo" if until we encounter (in our alphabetic iter thru scopes) a scope named something other than "foo". I have an attempt at a repro in progress but need to be AFK for some time. I'm pretty sure this is fix-forwardable but if folks want to roll back that makes sense. |
I thought I had a theory and a repro in a unit test but I don't. In fact I can't repro at using Right now it feels to me like the reason I couldn't get it to fail is that the streaming behavior of stats_request.cc is not working as expected, and the buffering that's going on is enabling the mechanism to coalesce all the common tag-extracted-names. It might be that with a few more clusters there's some flushing going on but I haven't sorted it out. I think I'll write a revert PR if no one else has yet. #27224 is my attempt at a repro in a unit test. Note I used ThreadLocalStore in the unit test because I was thinking multiple distinct scopes were needed to repro this issue and that really doesn't happen with the TestStore instances based on IsolatedStoreImpl. |
It would be helpful if someone could post a complete repro; e.g. attach a file; maybe an altered version of configs/envoy-demo.yaml. |
@jmarantz I test locally with this docker-compose https://github.com/zirain/playground/blob/master/envoy/docker-compose.yml |
All I need is the yaml config file; I don't think it matters if it's in docker or not. |
Thanks @zirain; that does it:
|
Closing this issue as the PR was reverted. Following up in #16139 with further notes on why the algorithm didn't work as expected. |
Commit Message: Reproduces a scenario where it's difficult to use a streaming optimization for Prometheus stats based on scopes without introducing a bug. Context: Issue that /stats/prometheus endpoint takes too much much memory: Prometheus stats handler used too much memory. #16139 Solution for non-Prometheus endpoints to use less memory for /stats and run faster: admin: Streaming /stats implementation #19693 which I believe is working well. This solution mapped to Prometheus: Prom stats perf improvements #24998 A case where this solution has a duplicate# TYPE line due to two stats with the same tag-extracted-stat name from two different scopes: stats: prometheus scrape failed #27173 Note that the existing unit tests did not exercise that scenario so this PR adds a testcase. Signed-off-by: Joshua Marantz <[email protected]>
Commit Message: Reverts envoyproxy#24998 See envoyproxy#27173 Additional Description: Risk Level: n/a Testing: n/a Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Ryan Eskin <[email protected]>
…xy#27239) Commit Message: Reproduces a scenario where it's difficult to use a streaming optimization for Prometheus stats based on scopes without introducing a bug. Context: Issue that /stats/prometheus endpoint takes too much much memory: Prometheus stats handler used too much memory. envoyproxy#16139 Solution for non-Prometheus endpoints to use less memory for /stats and run faster: admin: Streaming /stats implementation envoyproxy#19693 which I believe is working well. This solution mapped to Prometheus: Prom stats perf improvements envoyproxy#24998 A case where this solution has a duplicate# TYPE line due to two stats with the same tag-extracted-stat name from two different scopes: stats: prometheus scrape failed envoyproxy#27173 Note that the existing unit tests did not exercise that scenario so this PR adds a testcase. Signed-off-by: Joshua Marantz <[email protected]> Signed-off-by: Ryan Eskin <[email protected]>
the following golang code:
will pass with
envoyproxy/envoy:dev-01aed8aa302d707a5598dbfd063dcfe3b31655f8
and fail withenvoy:dev-6301fefe405d5c9bade12658034f17f5ad4cdd3c
after.txt
before.txt
seems related to #24998
cc @kyessenov @rulex123 @jmarantz
The text was updated successfully, but these errors were encountered: