Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: prometheus scrape failed #27173

Closed
zirain opened this issue May 4, 2023 · 17 comments
Closed

stats: prometheus scrape failed #27173

zirain opened this issue May 4, 2023 · 17 comments
Assignees

Comments

@zirain
Copy link
Member

zirain commented May 4, 2023

the following golang code:

package main

import (
	"fmt"
	"io"
	"log"
	"net/http"
	"strings"
	"time"

	"github.com/prometheus/common/expfmt"
)

func main() {
	_, body, err := HTTPGet(fmt.Sprintf("http://127.0.0.1:%d/stats/prometheus", 9901))
	if err != nil {
		fmt.Println("get metric error: %v", err)
		return
	}
	reader := strings.NewReader(body)
	_, err = (&expfmt.TextParser{}).TextToMetricFamilies(reader)
	if err != nil {
		fmt.Println("parse metric error: %v body: %s", err, body)
	}
}

func HTTPGet(url string) (code int, respBody string, err error) {
	log.Println("HTTP GET", url)
	client := &http.Client{Timeout: time.Minute}
	resp, err := client.Get(url)
	if err != nil {
		log.Println(err)
		return 0, "", err
	}
	defer resp.Body.Close()
	body, err := io.ReadAll(resp.Body)
	if err != nil {
		log.Println(err)
		return 0, "", err
	}
	respBody = string(body)
	code = resp.StatusCode
	return code, respBody, nil
}

will pass with envoyproxy/envoy:dev-01aed8aa302d707a5598dbfd063dcfe3b31655f8 and fail with envoy:dev-6301fefe405d5c9bade12658034f17f5ad4cdd3c

after.txt
before.txt

seems related to #24998

cc @kyessenov @rulex123 @jmarantz

@zirain zirain added bug triage Issue requires triage labels May 4, 2023
@rulex123
Copy link
Contributor

rulex123 commented May 4, 2023

It seems that some metrics are now rendered multiple times (effectively being treated as independent metrics rather then the same metric but with differing tags).

Example of a problematic metric in after.txt

# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="forward_cluster"} 0
# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="grafana"} 0
# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="local_hugo_serve"} 0
# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="thanos"} 0

vs same metric in before.txt

# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="forward_cluster"} 0
envoy_cluster_assignment_stale{envoy_cluster_name="grafana"} 0
envoy_cluster_assignment_stale{envoy_cluster_name="local_hugo_serve"} 0
envoy_cluster_assignment_stale{envoy_cluster_name="thanos"} 0

@rulex123
Copy link
Contributor

rulex123 commented May 4, 2023

based on the attached after.txt, the metrics whose output is incorrect are

grep "# TYPE"  after.txt | sort | uniq -c |sort -r
   4 # TYPE envoy_cluster_default_total_match_count counter
   4 # TYPE envoy_cluster_circuit_breakers_high_rq_retry_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_high_rq_pending_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_high_rq_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_high_cx_pool_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_high_cx_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_default_rq_retry_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_default_rq_pending_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_default_rq_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_default_cx_pool_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_default_cx_open gauge
   4 # TYPE envoy_cluster_bind_errors counter
   4 # TYPE envoy_cluster_assignment_timeout_received counter
   4 # TYPE envoy_cluster_assignment_stale counter
   3 # TYPE envoy_cluster_lb_zone_routing_sampled counter
   3 # TYPE envoy_cluster_lb_zone_routing_cross_zone counter
   3 # TYPE envoy_cluster_lb_zone_routing_all_directly counter
   3 # TYPE envoy_cluster_lb_zone_number_differs counter
   3 # TYPE envoy_cluster_lb_zone_no_capacity_left counter
   3 # TYPE envoy_cluster_lb_zone_cluster_too_small counter
   3 # TYPE envoy_cluster_lb_subsets_selected counter
   3 # TYPE envoy_cluster_lb_subsets_removed counter
   3 # TYPE envoy_cluster_lb_subsets_fallback_panic counter
   3 # TYPE envoy_cluster_lb_subsets_fallback counter
   3 # TYPE envoy_cluster_lb_subsets_created counter
   3 # TYPE envoy_cluster_lb_subsets_active gauge
   3 # TYPE envoy_cluster_lb_recalculate_zone_structures counter
   3 # TYPE envoy_cluster_lb_local_cluster_not_ok counter
   3 # TYPE envoy_cluster_lb_healthy_panic counter
   2 # TYPE envoy_cluster_retry_or_shadow_abandoned counter
   2 # TYPE envoy_cluster_original_dst_host_invalid counter
   2 # TYPE envoy_cluster_membership_total gauge
   2 # TYPE envoy_cluster_membership_healthy gauge
   2 # TYPE envoy_cluster_membership_excluded gauge
   2 # TYPE envoy_cluster_membership_degraded gauge
   2 # TYPE envoy_cluster_membership_change counter
   2 # TYPE envoy_cluster_max_host_weight gauge
   1 # TYPE envoy_workers_watchdog_miss counter

while all other stats are rendered correctly.. and as you can see from the dupes progressively decrease. Hm, weird: I wonder if this is due to something specific to these tests or if there's some underlying issue introduced by #24998

@yanavlasov yanavlasov added area/stats and removed triage Issue requires triage labels May 4, 2023
@yanavlasov
Copy link
Contributor

@ggreenway for any ideas

@jmarantz
Copy link
Contributor

jmarantz commented May 4, 2023

Scanning the issue description, it seems like the issue is that previously # TYPE envoy_cluster_assignment_stale counter would only be issued once for all the tags under that name, and now the # TYPE line is issued for every tag.

It should be easy to repro in a unit test. I'm hoping we can fix forward quickly.

@rulex123
Copy link
Contributor

rulex123 commented May 5, 2023

I am afraid is not a simple issue with the way a metric is rendered, something seems amiss on a more fundamental level.

I can easily observe the same issue if I run Envoy locally with a few clusters configured; when I curl the prometheus stats admin endpoint I can see that the output is completely jumbled up. Here's an excerpt of the output:

# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="authz_cluster"} 0
# TYPE envoy_cluster_assignment_timeout_received counter
envoy_cluster_assignment_timeout_received{envoy_cluster_name="authz_cluster"} 0
# TYPE envoy_cluster_bind_errors counter
envoy_cluster_bind_errors{envoy_cluster_name="authz_cluster"} 0
# TYPE envoy_cluster_default_total_match_count counter
envoy_cluster_default_total_match_count{envoy_cluster_name="authz_cluster"} 1
# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="health_cluster"} 0
# TYPE envoy_cluster_assignment_timeout_received counter
envoy_cluster_assignment_timeout_received{envoy_cluster_name="health_cluster"} 0
# TYPE envoy_cluster_bind_errors counter
envoy_cluster_bind_errors{envoy_cluster_name="health_cluster"} 0
# TYPE envoy_cluster_default_total_match_count counter
envoy_cluster_default_total_match_count{envoy_cluster_name="health_cluster"} 1

The expected result is that metrics with the same name (e.g. envoy_cluster_assignment_stale) but different tags (e.g. envoy_cluster_name) are grouped together and rendered as one "block" (with one "# TYPE" line preceding each group), e.g

# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="authz_cluster"} 0
envoy_cluster_bind_errors{envoy_cluster_name="health_cluster"} 0

I tried briefly to reproduce this issue in a test but didn't manage so far. I don't have any leads yet so we might need to revert the changes in #24998 unless @jmarantz or @ggreenway have any ideas or pointers on what might be going on?

@jmarantz
Copy link
Contributor

jmarantz commented May 5, 2023

@zirain would you be able to provide a bootstrap config to help repro? Thanks!

@ggreenway
Copy link
Contributor

@jmarantz it looks like @rulex123 said it was trivial to reproduce above, just not in a test.

I think unless we have a solid idea of how to fix it quickly we should revert. This is a major regression.

@kyessenov
Copy link
Contributor

You can take configs/envoy-demo and duplicate the cluster.

@jmarantz
Copy link
Contributor

jmarantz commented May 5, 2023

I see. I'm thinking the problem is that tag-extracted names that need to be grouped together are coming from different Stats::Scope objects, and we are streaming them out from the first scope before examining the second scope.

This is going to require some thought on how to fix without compromising the speed/memory benefit offered by the new algo @rulex123 deployed.

@jmarantz
Copy link
Contributor

jmarantz commented May 5, 2023

I'm pretty sure we can fix this by considering all the scopes named the same thing at the same time. E.g. don't stream out the results from a scope named "foo" if until we encounter (in our alphabetic iter thru scopes) a scope named something other than "foo".

I have an attempt at a repro in progress but need to be AFK for some time. I'm pretty sure this is fix-forwardable but if folks want to roll back that makes sense.

@jmarantz
Copy link
Contributor

jmarantz commented May 6, 2023

I thought I had a theory and a repro in a unit test but I don't. In fact I can't repro at using ./bazel-bin/source/exe/envoy-static -c configs/envoy-demo.yaml and wget -q -O - 'localhost:8080/stats?format=prometheus'|grep "# TYPE "|sort|uniq -c|sort -r|head; I only get once instance of each TYPE. I was expecting it to fail due to the problem of aliasing two different counters from different scopes that wind up with the same tag-extracted name. But I couldn't get it to fail as expected; I'll have to come back to it or maybe @rulex123 can pick it up.

Right now it feels to me like the reason I couldn't get it to fail is that the streaming behavior of stats_request.cc is not working as expected, and the buffering that's going on is enabling the mechanism to coalesce all the common tag-extracted-names. It might be that with a few more clusters there's some flushing going on but I haven't sorted it out.

I think I'll write a revert PR if no one else has yet.

#27224 is my attempt at a repro in a unit test. Note I used ThreadLocalStore in the unit test because I was thinking multiple distinct scopes were needed to repro this issue and that really doesn't happen with the TestStore instances based on IsolatedStoreImpl.

@jmarantz
Copy link
Contributor

jmarantz commented May 6, 2023

It would be helpful if someone could post a complete repro; e.g. attach a file; maybe an altered version of configs/envoy-demo.yaml.

@zirain
Copy link
Member Author

zirain commented May 6, 2023

@jmarantz
Copy link
Contributor

jmarantz commented May 6, 2023

All I need is the yaml config file; I don't think it matters if it's in docker or not.

@zirain
Copy link
Member Author

zirain commented May 6, 2023

@jmarantz
Copy link
Contributor

jmarantz commented May 6, 2023

Thanks @zirain; that does it:

      4 # TYPE envoy_cluster_default_total_match_count counter
      4 # TYPE envoy_cluster_circuit_breakers_high_rq_retry_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_high_rq_pending_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_high_rq_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_high_cx_pool_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_high_cx_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_default_rq_retry_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_default_rq_pending_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_default_rq_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_default_cx_pool_open gauge

jmarantz added a commit that referenced this issue May 6, 2023
Commit Message: Reverts #24998

See #27173

Additional Description:
Risk Level: n/a
Testing: n/a
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
@jmarantz
Copy link
Contributor

jmarantz commented May 6, 2023

Closing this issue as the PR was reverted. Following up in #16139 with further notes on why the algorithm didn't work as expected.

jmarantz added a commit that referenced this issue May 8, 2023
Commit Message: Reproduces a scenario where it's difficult to use a streaming optimization for Prometheus stats based on scopes without introducing a bug. Context:

Issue that /stats/prometheus endpoint takes too much much memory: Prometheus stats handler used too much memory. #16139
Solution for non-Prometheus endpoints to use less memory for /stats and run faster: admin: Streaming /stats implementation #19693 which I believe is working well.
This solution mapped to Prometheus: Prom stats perf improvements #24998
A case where this solution has a duplicate# TYPE line due to two stats with the same tag-extracted-stat name from two different scopes: stats: prometheus scrape failed #27173
Note that the existing unit tests did not exercise that scenario so this PR adds a testcase.

Signed-off-by: Joshua Marantz <[email protected]>
reskin89 pushed a commit to reskin89/envoy that referenced this issue Jul 11, 2023
Commit Message: Reverts envoyproxy#24998

See envoyproxy#27173

Additional Description:
Risk Level: n/a
Testing: n/a
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Ryan Eskin <[email protected]>
reskin89 pushed a commit to reskin89/envoy that referenced this issue Jul 11, 2023
…xy#27239)

Commit Message: Reproduces a scenario where it's difficult to use a streaming optimization for Prometheus stats based on scopes without introducing a bug. Context:

Issue that /stats/prometheus endpoint takes too much much memory: Prometheus stats handler used too much memory. envoyproxy#16139
Solution for non-Prometheus endpoints to use less memory for /stats and run faster: admin: Streaming /stats implementation envoyproxy#19693 which I believe is working well.
This solution mapped to Prometheus: Prom stats perf improvements envoyproxy#24998
A case where this solution has a duplicate# TYPE line due to two stats with the same tag-extracted-stat name from two different scopes: stats: prometheus scrape failed envoyproxy#27173
Note that the existing unit tests did not exercise that scenario so this PR adds a testcase.

Signed-off-by: Joshua Marantz <[email protected]>
Signed-off-by: Ryan Eskin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants