Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Noticed that our data seemed to get "stuck" on a value and just flatlined. Debugging a bit further, it seems that the Tempest stopped sending UDP messages but unfortunately my alerting didn't fire because there were still metrics being published, just stale values. This is because the gauge will hold onto the previous value indefinitely and never expire it (which makes sense in some situations, but not others).
This PR wraps the
prom-client
Gauge instances to keep track of the "last seen" time. If a guage hasn't been updated in a given period, it will automatically remove it. Once the gauge is set again, it will reset the "last seen" time and publish the gauge's value for anothertimeout
period. It doesn't solve the underlying issue with the Tempest failing to publish UDP messages, but it should make the rest of the ingestion / alerting pipeline better by correctly clearing those metrics if we haven't gotten an update.Other metrics libraries have this feature built-in, such as Rust's
metrics
(see example usage here). It doesn't seem that npm'sprom-client
supports this feature, so I may consider filing a feature request to see the feasibility. This feature should be possible to add in a backwards compatible way (by default, don't expire metrics).