Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TestAgent] [Telemetry] cannot "instantiate" multiple agents with prometheus telemetry on #11273

Closed
acpana opened this issue Oct 11, 2021 · 4 comments
Labels
theme/testing Testing, and related enhancements

Comments

@acpana
Copy link
Contributor

acpana commented Oct 11, 2021

Overview

We leverage a TestAgent type to stand up agents in our unit tests. In init-ing a TestAgent and managing its lifecycle, we use a mix of internal APIs and testing functions.

When it comes to telemetry set up, TestAgent calls on InitTelemetry().

Multiple TestAgents with Prometheus metrics turned on cannot come up due to the underlying registration process.

Current Behavior

	hcl := `
	telemetry = {
		prometheus_retention_time = "5s",
		disable_hostname = true
	}
	`

	a1 := StartTestAgent(t, TestAgent{HCL: hcl})
	a2 := StartTestAgent(t, TestAgent{HCL: hcl})

-->

   : testagent.go:106: failed to create base deps: failed to initialize telemetry: duplicate metrics collector registration attempted

This is still a working theory but, it seems that because we configure the same gauges, counters and summaries, we essentially make the same collector hash and trigger the underlying error in the registration process:

https://github.com/prometheus/client_golang/blob/master/prometheus/registry.go#L331-L344

Or see[1] for a related possible root cause.

Desired Behavior

I want us to be able to instantiate multiple agents with telemetry (whether prometheus or not) turned on.

Proposed Fix

IMO, we need to do the following:

  • any "cleanup" needed to be done on shutdown in the consul agent/ TestAgent around telemetry management needs to be plumbed thru
  • probably for the actual cleanup, go-metrics needs to expose the functionality to "unregister" a prom sink.
  • This may need fixing first Issue ref: collector collisions go-metrics#128 -- see [1]

Misc

[1]: If the above statement is true the same gauges, counters and summaries, we essentially make the same collector, then I was expecting that prefixing the metrics with a different string to work. But this code snipped fails in the same way as well:

  	hcl1 := `
  	telemetry = {
		prometheus_retention_time = "5s"
		disable_hostname = true
	}
	`

	a1 := StartTestAgent(t, TestAgent{HCL: hcl1})

	hcl2 := `
	telemetry = {
		prometheus_retention_time = "5s"
		disable_hostname = true
                metrics_prefix = "agent_2"
	}
        `
	a2 := StartTestAgent(t, TestAgent{HCL: hcl2})

I think if we actually look at the PrometheusSink implementation in go-metrics the Describe() calls may actually cause the collisions. So it may well be that by actually describing the gauges, counters, summaries there that we could "fix" this problem.

@dnephin
Copy link
Contributor

dnephin commented Oct 13, 2021

Our PrometheusOpts has a Registerer field, and that type is the one that accepts our sink. If we were to allow InitTelemetry to set a custom Registerer instead of the default, would that fix this problem? We will also need a way to tell the AgentMetrics endpoint to use a non-default Gatherer, but that should be a small enough change as well.

@acpana
Copy link
Contributor Author

acpana commented Oct 13, 2021

If we were to allow InitTelemetry to set a custom Registerer instead of the default, would that fix this problem?

Yes!

That's certainly a great way to "own our destiny" (from the consul side). I'm not too sure if the trade offs of writing a custom Registry + Gatherer would be in our favor, but I may just be naive.

And even if it ends up being the "easier" changes, I think it's not as elegant as fixing this in go-metrics 🤔 . It feels that we'd "bloat" the consul lib module by leaking this abstraction

Open to more debate here, of course!

@acpana
Copy link
Contributor Author

acpana commented Oct 14, 2021

So I was playing around with the ergonomics of using lib.Telemetry.PrometheusOpts.Registerer.

  1. I tried at the TestAgent level which can look something like this:
result.RuntimeConfig.Telemetry.PrometheusOpts.Registerer = prometheus.NewRegistry()
prometheus.DefaultRegisterer = result.RuntimeConfig.Telemetry.PrometheusOpts.Registerer
prometheus.DefaultGatherer = result.RuntimeConfig.Telemetry.PrometheusOpts.Registerer.(prometheus.Gatherer)

Unfortunately, while we can create multiple TestAgents as such (since we only register one Collector per Registry), this seems to make our metrics disappear? I checked the flow thru consul/lib/Telemetry, go-metrics, and thru the AgentMetrucs endpoint to see what's falling thru the cracks but couldn't figure out what/ why.

  1. I tried adding it at the lib.InitTelemetry level but a similar side effect happened as described above.

IMO, even if we disregard the issue above (or pretend that I missed something obvious), I do believe that it makes for odd ergonomics to expose "prometheus things" outside of go-metrics.

So I was going to re-work hashicorp/go-metrics#129 to make the Describe() calls around Name and make the calls be idempotent.

@acpana
Copy link
Contributor Author

acpana commented Nov 1, 2021

I'm gonna close this issue for now since we can instantiate multiple TestAgents as per #11290.

There's probably still more to do regarding:

IMO, we need to do the following:
any "cleanup" needed to be done on shutdown in the consul agent/ TestAgent around telemetry management needs to be plumbed thru
probably for the actual cleanup, go-metrics needs to expose the functionality to "unregister" a prom sink.

but I'm gonna go out on a limb and say that "unregister"-ing/ cleanup should be mostly done by go-metrics (the underlying abstraction).

If there are needs to do it in the agent, we can revisit this issue.

@acpana acpana closed this as completed Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/testing Testing, and related enhancements
Projects
None yet
Development

No branches or pull requests

2 participants