-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metric definitions for all metrics known at Consul start #9198
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
I mostly looked at the plumbing that reads in all the definitions, I haven't looked at the definitions themselves.
Left a couple concerns/suggestions
lib/telemetry.go
Outdated
// EmptyPrometheusDefs returns a PrometheusDefs struct where each of the slices have zero elements, but not nil. | ||
func EmptyPrometheusDefs() PrometheusDefs { | ||
return PrometheusDefs{ | ||
Gauges: []prometheus.GaugeDefinition{}, | ||
Counters: []prometheus.CounterDefinition{}, | ||
Summaries: []prometheus.SummaryDefinition{}, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zero value for structs and slices is usable without being initialized, so I don't think this function is necessary. The one place that uses it can use lib.PrometheusDefs{}
which is equivalent, and avoids allocating memory for the slice headers.
Or, with my suggestion to remove that type, the other caller can pass in a TelemetryConfig
with the PrometheusOpts field that uses the zero value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is good to know - I was a bit concerned to pass a nil into go-metrics where I have less ability to make changes. Thankfully go does what you'd want here for once and doesn't panic w/ ranges over nil slices 😂. Seems a bit inconsistent compared to nil elsewhere, but I prefer this behavior so I'll gladly take it.
lib/telemetry.go
Outdated
// InitTelemetry configures go-metrics based on map of telemetry config | ||
// values as returned by Runtimecfg.Config(). | ||
func InitTelemetry(cfg TelemetryConfig) (*metrics.InmemSink, error) { | ||
func InitTelemetry(cfg TelemetryConfig, defs PrometheusDefs) (*metrics.InmemSink, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally if a function accepts a config or options struct all the config would be part of that struct. This convention seems appropriate here as well.
Instead of a new PrometheusDefs
struct, TelemetryConfig
can use a PrometheusOpts
field with type prometheus.PrometheusOpts
. This would remove the need for the PrometheusDefs
which closely mirrors an existing struct, and would also also use to remove the PrometheusRetentionTime
field, since that field is already in PrometheusOpts
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree w/ that - the prometheus defs were a workaround to avoid adding an extra arg for each of the definition slices. I was a bit concerned about RuntimeConfig not mirroring the config we get in from disk like it does in many places. But adding the PrometheusOpts on there and translating it in builder.go
does cut out of a bunch of boilerplate here.
agent/setup.go
Outdated
// Set Consul to each definition's namespace | ||
var withService []prometheus.GaugeDefinition | ||
for _, gauge := range g { | ||
gauge.Name = append(serviceName, gauge.Name...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there are two problems here:
- this assumes that the
TelemetryConfig.MetricsPrefix
will always beconsul
. That is the default value, but it can be changed by the user. I believe this will initialize the wrong metrics if a user specifies a different metrics prefix. append
does not modify the slice passed in as the first argument, but it can modify the underlying array used by that slice if the array is not at capacity. This is a subtle thing that can lead to very strange bugs. In this case the array backingserviceName
could be used for everygauge.Name
. So when the second iteration runs it will modify the value for the first iteration. Every gauge could end up with the same name, so only the very last one will get initialized. I believe it works currently because the array is being created with a capacity of 1, but to fix the first issue, that likely will not work anymore.
This is an example of the problem: https://play.golang.org/p/QFxtyGXrlqw
I believe the convention when using append is assign the result of append to the variable passed as the first argument. There are only a few rare cases when it can be assigned to another variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since go-metrics
is the code that adds the prefix to the name it seems like it should also handle prepending this name. Otherwise every caller will have to guess at how go-metrics
is building up the final metric name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should absolutely be using TelemetryConfig.MetricsPrefix
that is a bug.
With the slice prepends, I agree that these should be handled by go-metrics -- I think for this release it's ok if we handle them here so long as we write an issue for it. (I've added some todos as well) It would be impossible to assign the var back to the first var and still be performing a prepend operation, we'd be appending in that case. It looks like if we allocate a new literal for each prepend, we avoid any potential bugs. https://play.golang.org/p/F_zPCegmka7
Here's the prometheus page with a different metrics_prefix
supplied and using the literals in the commit following this comment.
…teral to avoid bugs from append modifying its first arg
CI's failing for real, gotta fix the runtime_config test... |
…hub.com/hashicorp/consul into mkcp/telemetry/add-all-metric-definitions
Done wrestling with the runtime and lib config tests. I worked around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Couple very minor suggestions for follow ups, but nothing blocking
case reflect.Struct: | ||
if f.Type() == reflect.TypeOf(prometheus.PrometheusOpts{}) { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any callers of TelemetryConfig.MergeDefaults
in OSS or Ent. I suspect this MergeDefaults
function is leftover from a while ago. The only place I see it called is a test, which is I guess what failed and prompted this change. It looks like the only non-test call to it was removed in 65be587
Not urgent right now, but in a follow up we could delete this method I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a note comment
.changelog/9198.txt
Outdated
@@ -0,0 +1,3 @@ | |||
```release-note:improvement | |||
server: All metrics should be present and available to prometheus scrapers when Consul starts. If any non-deprecated metrics are missing please submit an issue with its name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very minor: I know we aren't always consistent with our categories, but I wonder if agent:
would be more appropriate than server:
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - client agents are affected too
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/283165. |
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/283196. |
🍒✅ Cherry pick of commit d15b6fd onto |
…-definitions Add metric definitions for all metrics known at Consul start
…-definitions Add metric definitions for all metrics known at Consul start
Followup question - does this changes renders |
No, Found a few warning in Consul when gathering the metrics however: #9471 |
@mkcp This patch did create a few issues in Consul 1.9.x as described in #9471 , but I think I have a patch in hashicorp/go-metrics#122 |
I went through to every metric written in Consul that I could find and created metric definitions for them. I also declare each raft metric that we consider a "key metric" for Consul - though we'll want to migrate those out to the raft lib in a future patch.
I then went back over the telemetry.mdx list to grab the help text and ensure I got all of our metrics. I don't guarantee that this list is complete, the search list is long and a few may have eluded me that weren't in telemetry.mdx, but it's pretty dang close. Everything that's critical for monitoring Consul is present. Any discrepancies I found between telemetry.mdx and Consul's codebase are documented here #9197. To keep the scope of this manageable: any metrics missing from telemetry.mdx I did not give a help text, nor did I delete any "stale" metrics in telemetry.mdx that no longer appear in Consul's codebase.
One important note for review: beside all of the definitions themselves, I do add functionality in the
agent/setup.go
file. There I reference all of our metrics in Consul and add the service namespace to them. Iterating over each metric to add the namespace in the agent prevents us from having to add the namespace to every definition by hand, leading to a difference between defining and using the metrics (see #9182 for the bug caused by putting the namespace in when writing the metric).Here's the result! Beautiful metrics with beautiful helptext, guaranteed to be present throughout Consul's lifecycle, even if we haven't measured them within the prometheus_retention_time.
![Screen Shot 2020-11-13 at 4 31 17 PM](https://user-images.githubusercontent.com/938395/99133316-571de600-25ce-11eb-9aa4-176520ba495f.png)
(Finally 😉) Resolves #5140