stats: prometheus scrape failed #27173

zirain · 2023-05-04T07:32:20Z

the following golang code:

package main

import (
	"fmt"
	"io"
	"log"
	"net/http"
	"strings"
	"time"

	"github.com/prometheus/common/expfmt"
)

func main() {
	_, body, err := HTTPGet(fmt.Sprintf("http://127.0.0.1:%d/stats/prometheus", 9901))
	if err != nil {
		fmt.Println("get metric error: %v", err)
		return
	}
	reader := strings.NewReader(body)
	_, err = (&expfmt.TextParser{}).TextToMetricFamilies(reader)
	if err != nil {
		fmt.Println("parse metric error: %v body: %s", err, body)
	}
}

func HTTPGet(url string) (code int, respBody string, err error) {
	log.Println("HTTP GET", url)
	client := &http.Client{Timeout: time.Minute}
	resp, err := client.Get(url)
	if err != nil {
		log.Println(err)
		return 0, "", err
	}
	defer resp.Body.Close()
	body, err := io.ReadAll(resp.Body)
	if err != nil {
		log.Println(err)
		return 0, "", err
	}
	respBody = string(body)
	code = resp.StatusCode
	return code, respBody, nil
}

will pass with envoyproxy/envoy:dev-01aed8aa302d707a5598dbfd063dcfe3b31655f8 and fail with envoy:dev-6301fefe405d5c9bade12658034f17f5ad4cdd3c

after.txt
before.txt

seems related to #24998

cc @kyessenov @rulex123 @jmarantz

The text was updated successfully, but these errors were encountered:

rulex123 · 2023-05-04T09:22:36Z

It seems that some metrics are now rendered multiple times (effectively being treated as independent metrics rather then the same metric but with differing tags).

Example of a problematic metric in after.txt

# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="forward_cluster"} 0
# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="grafana"} 0
# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="local_hugo_serve"} 0
# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="thanos"} 0

vs same metric in before.txt

# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="forward_cluster"} 0
envoy_cluster_assignment_stale{envoy_cluster_name="grafana"} 0
envoy_cluster_assignment_stale{envoy_cluster_name="local_hugo_serve"} 0
envoy_cluster_assignment_stale{envoy_cluster_name="thanos"} 0

rulex123 · 2023-05-04T13:31:31Z

based on the attached after.txt, the metrics whose output is incorrect are

grep "# TYPE"  after.txt | sort | uniq -c |sort -r
   4 # TYPE envoy_cluster_default_total_match_count counter
   4 # TYPE envoy_cluster_circuit_breakers_high_rq_retry_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_high_rq_pending_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_high_rq_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_high_cx_pool_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_high_cx_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_default_rq_retry_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_default_rq_pending_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_default_rq_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_default_cx_pool_open gauge
   4 # TYPE envoy_cluster_circuit_breakers_default_cx_open gauge
   4 # TYPE envoy_cluster_bind_errors counter
   4 # TYPE envoy_cluster_assignment_timeout_received counter
   4 # TYPE envoy_cluster_assignment_stale counter
   3 # TYPE envoy_cluster_lb_zone_routing_sampled counter
   3 # TYPE envoy_cluster_lb_zone_routing_cross_zone counter
   3 # TYPE envoy_cluster_lb_zone_routing_all_directly counter
   3 # TYPE envoy_cluster_lb_zone_number_differs counter
   3 # TYPE envoy_cluster_lb_zone_no_capacity_left counter
   3 # TYPE envoy_cluster_lb_zone_cluster_too_small counter
   3 # TYPE envoy_cluster_lb_subsets_selected counter
   3 # TYPE envoy_cluster_lb_subsets_removed counter
   3 # TYPE envoy_cluster_lb_subsets_fallback_panic counter
   3 # TYPE envoy_cluster_lb_subsets_fallback counter
   3 # TYPE envoy_cluster_lb_subsets_created counter
   3 # TYPE envoy_cluster_lb_subsets_active gauge
   3 # TYPE envoy_cluster_lb_recalculate_zone_structures counter
   3 # TYPE envoy_cluster_lb_local_cluster_not_ok counter
   3 # TYPE envoy_cluster_lb_healthy_panic counter
   2 # TYPE envoy_cluster_retry_or_shadow_abandoned counter
   2 # TYPE envoy_cluster_original_dst_host_invalid counter
   2 # TYPE envoy_cluster_membership_total gauge
   2 # TYPE envoy_cluster_membership_healthy gauge
   2 # TYPE envoy_cluster_membership_excluded gauge
   2 # TYPE envoy_cluster_membership_degraded gauge
   2 # TYPE envoy_cluster_membership_change counter
   2 # TYPE envoy_cluster_max_host_weight gauge
   1 # TYPE envoy_workers_watchdog_miss counter

while all other stats are rendered correctly.. and as you can see from the dupes progressively decrease. Hm, weird: I wonder if this is due to something specific to these tests or if there's some underlying issue introduced by #24998

yanavlasov · 2023-05-04T13:34:01Z

@ggreenway for any ideas

jmarantz · 2023-05-04T13:44:47Z

Scanning the issue description, it seems like the issue is that previously # TYPE envoy_cluster_assignment_stale counter would only be issued once for all the tags under that name, and now the # TYPE line is issued for every tag.

It should be easy to repro in a unit test. I'm hoping we can fix forward quickly.

rulex123 · 2023-05-05T12:52:52Z

I am afraid is not a simple issue with the way a metric is rendered, something seems amiss on a more fundamental level.

I can easily observe the same issue if I run Envoy locally with a few clusters configured; when I curl the prometheus stats admin endpoint I can see that the output is completely jumbled up. Here's an excerpt of the output:

# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="authz_cluster"} 0
# TYPE envoy_cluster_assignment_timeout_received counter
envoy_cluster_assignment_timeout_received{envoy_cluster_name="authz_cluster"} 0
# TYPE envoy_cluster_bind_errors counter
envoy_cluster_bind_errors{envoy_cluster_name="authz_cluster"} 0
# TYPE envoy_cluster_default_total_match_count counter
envoy_cluster_default_total_match_count{envoy_cluster_name="authz_cluster"} 1
# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="health_cluster"} 0
# TYPE envoy_cluster_assignment_timeout_received counter
envoy_cluster_assignment_timeout_received{envoy_cluster_name="health_cluster"} 0
# TYPE envoy_cluster_bind_errors counter
envoy_cluster_bind_errors{envoy_cluster_name="health_cluster"} 0
# TYPE envoy_cluster_default_total_match_count counter
envoy_cluster_default_total_match_count{envoy_cluster_name="health_cluster"} 1

The expected result is that metrics with the same name (e.g. envoy_cluster_assignment_stale) but different tags (e.g. envoy_cluster_name) are grouped together and rendered as one "block" (with one "# TYPE" line preceding each group), e.g

# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="authz_cluster"} 0
envoy_cluster_bind_errors{envoy_cluster_name="health_cluster"} 0

I tried briefly to reproduce this issue in a test but didn't manage so far. I don't have any leads yet so we might need to revert the changes in #24998 unless @jmarantz or @ggreenway have any ideas or pointers on what might be going on?

jmarantz · 2023-05-05T17:22:34Z

@zirain would you be able to provide a bootstrap config to help repro? Thanks!

ggreenway · 2023-05-05T18:27:12Z

@jmarantz it looks like @rulex123 said it was trivial to reproduce above, just not in a test.

I think unless we have a solid idea of how to fix it quickly we should revert. This is a major regression.

kyessenov · 2023-05-05T18:28:10Z

You can take configs/envoy-demo and duplicate the cluster.

jmarantz · 2023-05-05T21:44:00Z

I see. I'm thinking the problem is that tag-extracted names that need to be grouped together are coming from different Stats::Scope objects, and we are streaming them out from the first scope before examining the second scope.

This is going to require some thought on how to fix without compromising the speed/memory benefit offered by the new algo @rulex123 deployed.

jmarantz · 2023-05-05T21:57:16Z

I'm pretty sure we can fix this by considering all the scopes named the same thing at the same time. E.g. don't stream out the results from a scope named "foo" if until we encounter (in our alphabetic iter thru scopes) a scope named something other than "foo".

I have an attempt at a repro in progress but need to be AFK for some time. I'm pretty sure this is fix-forwardable but if folks want to roll back that makes sense.

jmarantz · 2023-05-06T00:54:19Z

I thought I had a theory and a repro in a unit test but I don't. In fact I can't repro at using ./bazel-bin/source/exe/envoy-static -c configs/envoy-demo.yaml and wget -q -O - 'localhost:8080/stats?format=prometheus'|grep "# TYPE "|sort|uniq -c|sort -r|head; I only get once instance of each TYPE. I was expecting it to fail due to the problem of aliasing two different counters from different scopes that wind up with the same tag-extracted name. But I couldn't get it to fail as expected; I'll have to come back to it or maybe @rulex123 can pick it up.

Right now it feels to me like the reason I couldn't get it to fail is that the streaming behavior of stats_request.cc is not working as expected, and the buffering that's going on is enabling the mechanism to coalesce all the common tag-extracted-names. It might be that with a few more clusters there's some flushing going on but I haven't sorted it out.

I think I'll write a revert PR if no one else has yet.

#27224 is my attempt at a repro in a unit test. Note I used ThreadLocalStore in the unit test because I was thinking multiple distinct scopes were needed to repro this issue and that really doesn't happen with the TestStore instances based on IsolatedStoreImpl.

jmarantz · 2023-05-06T01:12:07Z

It would be helpful if someone could post a complete repro; e.g. attach a file; maybe an altered version of configs/envoy-demo.yaml.

zirain · 2023-05-06T01:14:16Z

@jmarantz I test locally with this docker-compose https://github.com/zirain/playground/blob/master/envoy/docker-compose.yml

jmarantz · 2023-05-06T01:16:25Z

All I need is the yaml config file; I don't think it matters if it's in docker or not.

zirain · 2023-05-06T01:17:38Z

https://github.com/zirain/playground/blob/master/envoy/envoy.yaml

jmarantz · 2023-05-06T01:23:01Z

Thanks @zirain; that does it:

      4 # TYPE envoy_cluster_default_total_match_count counter
      4 # TYPE envoy_cluster_circuit_breakers_high_rq_retry_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_high_rq_pending_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_high_rq_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_high_cx_pool_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_high_cx_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_default_rq_retry_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_default_rq_pending_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_default_rq_open gauge
      4 # TYPE envoy_cluster_circuit_breakers_default_cx_pool_open gauge

Commit Message: Reverts #24998 See #27173 Additional Description: Risk Level: n/a Testing: n/a Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a

jmarantz · 2023-05-06T12:36:44Z

Closing this issue as the PR was reverted. Following up in #16139 with further notes on why the algorithm didn't work as expected.

Commit Message: Reproduces a scenario where it's difficult to use a streaming optimization for Prometheus stats based on scopes without introducing a bug. Context: Issue that /stats/prometheus endpoint takes too much much memory: Prometheus stats handler used too much memory. #16139 Solution for non-Prometheus endpoints to use less memory for /stats and run faster: admin: Streaming /stats implementation #19693 which I believe is working well. This solution mapped to Prometheus: Prom stats perf improvements #24998 A case where this solution has a duplicate# TYPE line due to two stats with the same tag-extracted-stat name from two different scopes: stats: prometheus scrape failed #27173 Note that the existing unit tests did not exercise that scenario so this PR adds a testcase. Signed-off-by: Joshua Marantz <[email protected]>

Commit Message: Reverts envoyproxy#24998 See envoyproxy#27173 Additional Description: Risk Level: n/a Testing: n/a Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Ryan Eskin <[email protected]>

…xy#27239) Commit Message: Reproduces a scenario where it's difficult to use a streaming optimization for Prometheus stats based on scopes without introducing a bug. Context: Issue that /stats/prometheus endpoint takes too much much memory: Prometheus stats handler used too much memory. envoyproxy#16139 Solution for non-Prometheus endpoints to use less memory for /stats and run faster: admin: Streaming /stats implementation envoyproxy#19693 which I believe is working well. This solution mapped to Prometheus: Prom stats perf improvements envoyproxy#24998 A case where this solution has a duplicate# TYPE line due to two stats with the same tag-extracted-stat name from two different scopes: stats: prometheus scrape failed envoyproxy#27173 Note that the existing unit tests did not exercise that scenario so this PR adds a testcase. Signed-off-by: Joshua Marantz <[email protected]> Signed-off-by: Ryan Eskin <[email protected]>

zirain added bug triage Issue requires triage labels May 4, 2023

zirain mentioned this issue May 4, 2023

fix build istio/proxy#4642

Merged

yanavlasov added area/stats and removed triage Issue requires triage labels May 4, 2023

jmarantz assigned rulex123 May 4, 2023

jmarantz mentioned this issue May 6, 2023

try to repro a prometheus rendering issue with multiple TYPE tags of same name #27224

Closed

This was referenced May 6, 2023

stats: Revert "Prom stats perf improvements" #27225

Closed

stats: revert "Prom stats perf improvements" #27226

Merged

jmarantz added a commit that referenced this issue May 6, 2023

stats: revert "Prom stats perf improvements" (#27226)

5523257

Commit Message: Reverts #24998 See #27173 Additional Description: Risk Level: n/a Testing: n/a Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a

jmarantz mentioned this issue May 6, 2023

Prometheus stats handler used too much memory. #16139

Open

jmarantz closed this as completed May 6, 2023

jmarantz mentioned this issue May 8, 2023

stats: reproduce redundant TYPE tags problem in Prometheus. #27239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats: prometheus scrape failed #27173

stats: prometheus scrape failed #27173

zirain commented May 4, 2023 •

edited

Loading

rulex123 commented May 4, 2023 •

edited

Loading

rulex123 commented May 4, 2023

yanavlasov commented May 4, 2023

jmarantz commented May 4, 2023

rulex123 commented May 5, 2023

jmarantz commented May 5, 2023

ggreenway commented May 5, 2023

kyessenov commented May 5, 2023

jmarantz commented May 5, 2023

jmarantz commented May 5, 2023 •

edited

Loading

jmarantz commented May 6, 2023 •

edited

Loading

jmarantz commented May 6, 2023

zirain commented May 6, 2023

jmarantz commented May 6, 2023

zirain commented May 6, 2023

jmarantz commented May 6, 2023

jmarantz commented May 6, 2023

stats: prometheus scrape failed #27173

stats: prometheus scrape failed #27173

Comments

zirain commented May 4, 2023 • edited Loading

rulex123 commented May 4, 2023 • edited Loading

rulex123 commented May 4, 2023

yanavlasov commented May 4, 2023

jmarantz commented May 4, 2023

rulex123 commented May 5, 2023

jmarantz commented May 5, 2023

ggreenway commented May 5, 2023

kyessenov commented May 5, 2023

jmarantz commented May 5, 2023

jmarantz commented May 5, 2023 • edited Loading

jmarantz commented May 6, 2023 • edited Loading

jmarantz commented May 6, 2023

zirain commented May 6, 2023

jmarantz commented May 6, 2023

zirain commented May 6, 2023

jmarantz commented May 6, 2023

jmarantz commented May 6, 2023

zirain commented May 4, 2023 •

edited

Loading

rulex123 commented May 4, 2023 •

edited

Loading

jmarantz commented May 5, 2023 •

edited

Loading

jmarantz commented May 6, 2023 •

edited

Loading