Don't expire Prometheus metrics that have been explicitly defined #123
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#120 did excellent work to allow go-metrics consumers like Consul to define important metrics explicitly so they would not expire if they are not updated. The primary motivation was around ensuring we always provide metric summaries and descriptions for the most important metrics even if they've not changed for a while which was achieved.
The
Expiration
option in go-metrics though was still acting on Summaries and Gauges - effectively resetting them toNaN
after the expiry time. This was intentional per the PR and comment but has a few downsides:NaN
sample. While this works to force all the statistics in the summary toNaN
, it doesn't really reset it since the number of samples considered is still the total number of samples in the buffer - all the old samples plus theNaN
sample. Also, as far as I can tell, theNaN
sample remains in the buffer for the next time window so even later legitimate samples will not really be shown correctly until theNaN
is aged out. We hard code the summaries to have a 10 second window currently so this was not a huge deal but it was also not an especially useful "reset" semantic.Revisiting the purpose of the
Expiration
option, it's really to prevent "ephemeral" metrics that are produced a few times and then never produced again from being held in memory forever. By definition, metrics that are explicitly defined by the application are never ephemeral and will always consume memory in that definition map, so expiring them at all doesn't seem to be valuable. There is nothing gained from "forgetting" the value of any of these metrics and outputtingNaN
for some time until more values are recorded as far as I can see.For context, I've just added a couple of gauges to Raft (hashicorp/raft#452, hashicorp/raft#454) that are critical to monitoring for a known and severe failure mode in raft. These are not periodically emitted in a loop as that would be less efficient, but it's hard to use them to monitor for this case effectively if you can't see constant output values since the events that change them are often rare (restoring from snapshot typically only happens on a restart which might have been weeks ago). I thought by explicitly defining them in Consul I'd get constant output but the NaN reset prevents that and on considering it further, I can't think of a case where the reset is actually desirable behaviour at least for explicitly defined metrics.
@mkcp would love your thoughts though - perhaps I missed something where it is important that these reset even for explicitly defined values?
I added a test to ensure that even after expiring the metrics are not reset.