-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: Return empty values when host stats fail #6349
Conversation
ff316c8
to
9a7dbc8
Compare
Currently, there is an issue when running on Windows whereby under some circumstances the Windows stats API's will begin to return errors (such as internal timeouts) when a client is under high load, and potentially other forms of resource contention / system states (and other unknown cases). When an error occurs during this collection, we then short circuit further metrics emission from the client until the next interval. This can be problematic if it happens for a sustained number of intervals, as our metrics aggregator will begin to age out older metrics, and we will eventually stop emitting various types of metrics including `nomad.client.unallocated.*` metrics. However, when metrics collection fails on Linux, gopsutil will in many cases (e.g cpu.Times) silently return 0 values, rather than an error. Here, we switch to returning empty metrics in these failures, and logging the error at the source. This brings the behaviour into line with Linux/Unix platforms, and although making aggregation a little sadder on intermittent failures, will result in more desireable overall behaviour of keeping metrics available for further investigation if things look unusual.
7ae5048
to
c8ba938
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code stylistic points. It does seem reasonable to me to publish metrics we are able to collect as best effort; it makes sense to prioritize incomplete metrics over no metrics at all if we cannot get complete metrics and don't know how to get them.
Would love to have another review by folks with more client and metrics knowledge, in case of any promises we make about invariants in metrics.
Also, failing test seems relevant.
@@ -1335,10 +1335,14 @@ func (tr *TaskRunner) emitStats(ru *cstructs.TaskResourceUsage) { | |||
|
|||
if ru.ResourceUsage.MemoryStats != nil { | |||
tr.setGaugeForMemory(ru) | |||
} else { | |||
tr.logger.Debug("Skipping memory stats for allocation", "reason", "MemoryStats is nil") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would adjust the casing, we typically have lower case message. Also, MemoryStats is nil
is low level implementation; would be nice to have a user facing message, maybe something like:
tr.logger.Debug("Skipping memory stats for allocation", "reason", "MemoryStats is nil") | |
tr.logger.Debug("memory stats for alloc is unpopulated; skipping publishing them") |
@@ -117,39 +116,45 @@ func (h *HostStatsCollector) collectLocked() error { | |||
// Determine up-time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this and Collect()
still return an error? May make sense to make them void functions instead? Simplifies client.go
implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I'm planning on going back and refactoring this a little bit next week. Mostly wanted to get something together to help folks with debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is intended to mimic gopsutil behavior, then maybe we have the collect*Stats
funcs return the default values along with the error instead of needing to check and set them during error handling.
@nickethier I think I'm ok with having the defaults setting here in the same place as handling the error, mostly bc it gives us a single place to look/change things under those conditions, but I'm not set on it. |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Currently, there is an issue when running on Windows whereby under some
circumstances the Windows stats API's will begin to return errors (such
as internal timeouts) when a client is under high load, and potentially
other forms of resource contention / system states (and other unknown
cases).
When an error occurs during this collection, we then short circuit
further metrics emission from the client until the next interval.
This can be problematic if it happens for a sustained number of
intervals, as our metrics aggregator will begin to age out older
metrics, and we will eventually stop emitting various types of metrics
including
nomad.client.unallocated.*
metrics.However, when metrics collection fails on Linux, gopsutil will in many cases
(e.g cpu.Times) silently return 0 values, rather than an error.
Here, we switch to returning empty metrics in these failures, and
logging the error at the source. This brings the behaviour into line
with Linux/Unix platforms, and although making aggregation a little
sadder on intermittent failures, will result in more desireable overall
behaviour of keeping metrics available for further investigation if
things look unusual.
Alternatives
There are a few alternative approaches, including only merging
client_stats: Always emit client stats
, but this has the downside of keeping the status quo of breaking many metrics for a single host collector failing. We could also try to do something a bit smarter about not emittingundetected
metrics only rather than 0 values, but I'm not sure how valuable they would be.